💾 Archived View for jb55.com › ward.asia.wiki.org › stepping-the-async-scrape captured on 2022-01-08 at 14:03:28. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-12-04)

-=-=-=-=-=-=-

Stepping the Async Scrape

We've designed a scrape that runs with some parallelism captured with liberal use of async and await. Now we are thinking we can single-step this long running computation but have to think through what that even means.

See Search over the Horizon

Search over the Horizon

We will separate the queues for sites and slugs to be examined. We will preload sites with a few broadly connected sites. Visiting sites will produce slugs, visiting slugs will produce sites.

Aside: a slug is a page title in lower case with spaces turned to hyphens and other punctuation removed.

We'll single step through dosite pausing after each sitemap fetch reporting availability, activity and errors.

We'll single step through doslug pausing after page fetch reporting new and familiar sites and page format errors.

It makes sense to run either dosite or doslug against their respective queues independently. Both must run to complete a scrape.

A scrape will launch with a few seed sites and complete when both queues are empty and no work is in flight. A cron job can be configured to launch scrapes on a regular schedule.

See How To Deno

How To Deno