đŸ Archived View for tilde.pink âș ~ssb22 âș webcheck.gmi captured on 2022-03-01 at 15:27:07. Gemini links have been rewritten to link to archived content
âĄïž Next capture (2022-06-03)
-=-=-=-=-=-=-
WebCheck is a program to check Web pages for changes to specific phrases of text. Some web monitoring programs and âwatchlistâ facilities etc will tell you when *any* change is made to a page, but thatâs of limited use when you are interested in only a few specific phrases, especially when these are surrounded by many other items which change far more frequently than the one that actually interests you. So WebCheck lets you check for changes to a particular item on a page.
Note that this is not a âfoolproofâ method. If a page lists âold newsâ, or otherwise incorporates an old version of the item youâre monitoring, WebCheck might fail to spot the new situation. You have to use your judgement about when this program can reasonably be used.
WebCheck runs from the command line, usually from a cron job or similar, and writes any changes it found to standard output, which can then be emailed or whatever (if using ImapFix, try its --maybenote option).
The list of sites to check is in a text file called webcheck.list. Each line (apart from blank lines and comments) specifies a URL to fetch and some text to check, optionally followed by a comment (which starts with a # after a space). For example:
http://nice-program.example.com The latest version is 1.0
or
http://nice-program.example.com The latest version is 1.0 # otherwise we'd better upgrade
If the text starts with a * then the rest of it is treated as a regular expression, otherwise it is treated as a simple search.
You can check for the *absence* of certain text by prepending a ! to it:
http://wiki-page.example.org !spam
By default, the searches are made against the text on the page, not against its source code. If you want to check the source code, prepend a > to the text or !text.
If you need to make more than one test on the same page, simply add multiple lines with the same URL.âA shortcut for this is to specify also: on the second and subsequent lines, in place of the repeated URL.âWebcheck does of course perform multiple tests in the same fetch operationâthe fetch itself will not be duplicated for each test.â
It is possible to add arbitrary HTTP headers (such as Accept-Language: en) on lines of their own; these apply to all subsequently-listed URLs (except when using a Javascript processor, see below) until removed by setting them blank (e.g. Accept-Language:). One use of arbitrary headers is to send âcookiesâ to indicate youâve accepted the GDPR or whatever: in most graphical browsersâ Developer Options you can go to a Javascript console and type document.cookie to find out what to put in the Cookie: header to restore your current âsessionâ with the server.â
It is also possible to add :include directives if you wish to place some of your configuration into other files, e.g. :include wiki-pages.list (and if any file such as webcheck.list is a directory then the files inside it are read).â
You can follow new items on RSS/Atom feeds: give the feed URL and *no* search text.â
If the site lists new items but does not support RSS, you can also *extract* items, by setting the search text to {START...END} where START and END are starting and ending strings that surround each item.â(By default this is done on the parsed version of the page; to do it on the HTML source, add a > before the { at the start of the search text.)â
Besides checking http://, https:// and gemini:// URLs, you can check for:
If the text you wish to check is written by complex Javascript and thereâs no simple way to get it out of the siteâs source code, and/or if you need to âlog inâ or perform other interaction to make it available, then you could try installing one of:
and have WebCheck drive one of these.
Edbrowse is more lightweight and should be enough in many cases, but the others have more complete DOM support (see discussion on Edbrowse issue 4), and some sites will work *only* in Headless Chrome.âIn any case youâd be advised to set the check-frequency wisely (see Efficiency section below).â
For Edbrowse, prepend e:// to the URL, e.g.:
e://http://javascripty-site.example.com my comment
Note that checks on the âsourceâ of a rendered DOM (such as checks for class names written by Javascript) are *not* available when using Edbrowse: youâll have to run Headless Chrome or PhantomJS for those.
Advanced users of edbrowse can write scripts to perform simple interaction with a Javascript site before reading out the text, provided such interaction does not involve spaces, for example:
e://http://javascripty.example.org\/{LOG/\g\/<>/\i=my-username\/<>/\i=myPassword\/<Log/\i*\/{INBOX/\g No messages
Here, /{LOG/ searches for a link whose text begins with âLOGâ, g follows the first link on the current line, /<>/ searches for empty form fields, i= fills them in and i* submits; see the edbrowse manual for a full list. \ is used to separate commands; an implicit b (browse) command is added before the start and âprint allâ at the end.âSource is not shown.â
For PhantomJS or Headless Chrome, you need to install the âwebdriverâ (Selenium) interface.âIf you need to set it up in your home directory, try pip install selenium --root $HOME/whatever, set PYTHONPATH appropriately, and put the phantomjs or chromedriver binary in your PATH before running webcheck.â
An instruction to fetch data via Headless Chrome or PhantomJS looks like this:
{ http://site.example.org/ [Click here to show the login form] #txtUsername=me@example.com [#okButton] [Show results] "Results" }
where the first word is the starting URL, and items in square brackets will click either a link with that exact text or an element with the id or name specified after a # (check for id= or name= in a browserâs Document Inspector or similar), or the first element with the class specified after a . dot (you can specify other elements of a class someClass via .someClass#2 and .someClass#3 etc). #id=text sends keystrokes text to an input field with ID (or name) id (.class=text is also possible), and you can include space by adding a quoted phrase after the =. Text in quotes on its own causes the browser to wait until the page source contains it (which is usually necessary when using Headless Chrome or PhantomJS, less so with edbrowse).âAlso available is #id->text to select from a drop-down (by visible text; blank means deselect all; add quotes after the -> to select a multi-word phrase), and #id*n to set a checkbox to state n (0 or 1).
Some sites make you click each item on a results page to reveal an individual result.âTo automate this in Headless Chrome or PhantomJS, use /start/5 where âstartâ is the start of each item ID and 5 is the number of seconds to wait after clicking.âA snapshot of the page after each click will be added to that of the final page, and the checks (or item extractions) that you specify will occur on the combined result.âItâs assumed that no âbackâ button needs to be pressed between clicks.â
To be as efficient as is reasonable for this kind of program, WebCheck has the following features:
However, connection re-use and last-modified handling is *not* performed when using edbrowse or webdriver (except within each session of course).â
You can also change the frequency of specific checks with the days command, which must appear on a line of its own, for example:
days 5
which specifies that the addresses below that line will be checked only if the day they were previously checked was at least 5 days ago (unless they are also listed in sections that require more frequent checks). For convenience, daily, weekly and monthly are short for days 1, days 7 and days 30 respectively. If for testing you need to temporarily turn off all frequencies, Last-Modified and ETag checks but not the already-seen RSS items, you can specify --test-all on the WebCheck command line.â
webcheck.py (requires Python; compatible with both Python 2 and Python 3).
License: Apache 2.
All material © Silas S. Brown unless otherwise stated. Apache is a registered trademark of The Apache Software Foundation. Javascript is a trademark of Oracle Corporation in the US. Python is a trademark of the Python Software Foundation. Any other trademarks I mentioned without realising are trademarks of their respective holders.