💾 Archived View for jb55.com › ward.asia.wiki.org › selective-scrape-pages captured on 2022-01-08 at 14:48:14. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-04)
-=-=-=-=-=-=-
We add restrictions to Scrape Pages so that it runs faster and finds more relevant content. We've added a json output for downstream visualization and an example node app that reads it. github
Click scrape.html to run with defaults.
Click scrape.svg to see a download drawn with scrape.js.
We accept parameters each of which has a default. Click the asset scrape.html to see a scrape run with defaults. Edit the new tab's url to override defaults.
Limit the graph to include only pages edited in the last 10 days. Default is 30 days. A fork alone doesn't count as an edit. example
Start the scrape at the specified site. The default is found.ward.bay.wiki.org. The scrape discovers more when pages fork or otherwise reference new sites. example
We construct page objects within site objects as we scrape. We have a lot of latitude for what and when we record. A site may be present with no pages selected for inclusion.
Pages are recorded as objects with the fields shown. A slug is a lower-case hyphenated version of the title. Links will be a list of links found on the page. Forks are a list of additional sites where links may resolve in the order they should be checked.
We now properly punctuate and download the scrape results as a json file once the scrape has completed. The file name is generated from scrap parameters.
We've created as an example a node application, scrape.js, that can read a download file and render it as svg using graphviz dot notation. We connect sites with two different kinds of lines. enlarge
<img width=100% src=http://found.ward.bay.wiki.org/assets/pages/selective-scrape-pages/scrape.svg>
A solid line means a link followed by a click.
A dashed line means a twin forked from another site.