💾 Archived View for jb55.com › ward.asia.wiki.org › bbc-world-service captured on 2022-01-08 at 14:24:36. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-04)
-=-=-=-=-=-=-
Scraper for news stories from the BBC. news
This is a mockup of an idea for improving drag and drop content creation. See Interesting Places for hacking wiki.
A variation would be to 'scent' a web search with sites known to be well parsed as well as search terms.
<h3> Routing
The scraper would be trained to recognize urls based on samples showing similarities and differences. These may need to be marked up somehow to simplify recognition.
Routes will be handled by a server-side plugin that aggregates routes found within a site at startup with any remote pages with routing found in the lineup.
There could be whole sites devoted to collecting and applying routes.
<h3> Parsing
We'll assume sites use modern html with reasonable div tags and class names.
We'll organize parsing around detectors that construct specific output elements.
Detector specification will require some familiarity with html/css and browser debugging tools.
The server will be required to proxy non-CORS sites.
The server might apply detectors or pass them up to the client to be applied there.
Generated pages should cite the source and route page used to scrape it as provenance in the create action.