💾 Archived View for jb55.com › ward.asia.wiki.org › selective-scrape-pages captured on 2022-01-08 at 14:48:14. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-12-04)

-=-=-=-=-=-=-

Selective Scrape Pages

We add restrictions to Scrape Pages so that it runs faster and finds more relevant content. We've added a json output for downstream visualization and an example node app that reads it. github

Scrape Pages

github

Click scrape.html to run with defaults.

Click scrape.svg to see a download drawn with scrape.js.

Parameters

We accept parameters each of which has a default. Click the asset scrape.html to see a scrape run with defaults. Edit the new tab's url to override defaults.

Limit the graph to include only pages edited in the last 10 days. Default is 30 days. A fork alone doesn't count as an edit. example

example

Start the scrape at the specified site. The default is found.ward.bay.wiki.org. The scrape discovers more when pages fork or otherwise reference new sites. example

example

Download

We construct page objects within site objects as we scrape. We have a lot of latitude for what and when we record. A site may be present with no pages selected for inclusion.

Pages are recorded as objects with the fields shown. A slug is a lower-case hyphenated version of the title. Links will be a list of links found on the page. Forks are a list of additional sites where links may resolve in the order they should be checked.

We now properly punctuate and download the scrape results as a json file once the scrape has completed. The file name is generated from scrap parameters.

Example

We've created as an example a node application, scrape.js, that can read a download file and render it as svg using graphviz dot notation. We connect sites with two different kinds of lines. enlarge

enlarge

<img width=100% src=http://found.ward.bay.wiki.org/assets/pages/selective-scrape-pages/scrape.svg>

A solid line means a link followed by a click.

A dashed line means a twin forked from another site.