💾 Archived View for jb55.com › ward.asia.wiki.org › search-over-the-horizon captured on 2022-01-08 at 13:49:02. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-04)
-=-=-=-=-=-=-
We're exploring how a more data centric federated wiki server might work. We'd like this to be driven by technological innovation and improved applications we already have.
The tech revolves around ES6 modules which is pure geek. The application I'm looking into is federation search. See How Scrape Works.
We've rewritten this in animated javascript before. Let's make this even easier with deno wiki. page
I'm half way through rewriting this in JS async/await style with a queue and a clock that starts another scrape operation every second. This is the geek part. I'm also thinking that this should play with the new federation search in a way that grows gracefully with tens of thousands of sites.
But rather than indexing each site as I go I think I should leave that to the full text search recently implemented. Better to use federation indexing to build an over the horizon model of what we now think of as Rosters.
Say you have a Roster of a dozen sites and this isn't working for you with what you are looking for. So you say, federation, what's out beyond that? Maybe we then go to the federation map and compose together another 100 sites. This is still a small portion of the federation but it is 10x as likely to have what you are looking for. Not good enough? Try 100x.
So I am thinking the next generation scrape will just build a who-cites-whom directed graph of the federation as it exists right now. This is actually simpler than what we've done to date and will probably work much better too.
See Graphing the Horizon
See Stepping the Async Scrape
We launch a scrape with a list of root sites as arguments. This builds a flat file index of pages within sites listing sites mentioned on those pages.
This emits one network request every second until the scrape is complete. File modification dates are adjusted to reflect sitemap dates and are used to detect when changes require a new scrape of a page.
We should wait until the last write completes before exiting but otherwise allow network operations to overlap when they are slow.
We count unique errors reported while scraping.
We count unique slugs over all sites.
We count unique sites named by fork or reference.
See Over the Horizon Later for overnight stats.
Handy command for finding scrape data files modified within the last week.