💾 Archived View for jb55.com › ward.asia.wiki.org › search-over-the-horizon captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

Search over the Horizon

We're exploring how a more data centric federated wiki server might work. We'd like this to be driven by technological innovation and improved applications we already have.

The tech revolves around ES6 modules which is pure geek. The application I'm looking into is federation search. See How Scrape Works.

How Scrape Works

We've rewritten this in animated javascript before. Let's make this even easier with deno wiki. page

page

I'm half way through rewriting this in JS async/await style with a queue and a clock that starts another scrape operation every second. This is the geek part. I'm also thinking that this should play with the new federation search in a way that grows gracefully with tens of thousands of sites.

But rather than indexing each site as I go I think I should leave that to the full text search recently implemented. Better to use federation indexing to build an over the horizon model of what we now think of as Rosters.

Say you have a Roster of a dozen sites and this isn't working for you with what you are looking for. So you say, federation, what's out beyond that? Maybe we then go to the federation map and compose together another 100 sites. This is still a small portion of the federation but it is 10x as likely to have what you are looking for. Not good enough? Try 100x.

So I am thinking the next generation scrape will just build a who-cites-whom directed graph of the federation as it exists right now. This is actually simpler than what we've done to date and will probably work much better too.

See Graphing the Horizon

Graphing the Horizon

See Stepping the Async Scrape

Stepping the Async Scrape

Scrape

We launch a scrape with a list of root sites as arguments. This builds a flat file index of pages within sites listing sites mentioned on those pages.

This emits one network request every second until the scrape is complete. File modification dates are adjusted to reflect sitemap dates and are used to detect when changes require a new scrape of a page.

Consider

We should wait until the last write completes before exiting but otherwise allow network operations to overlap when they are slow.

We count unique errors reported while scraping.

We count unique slugs over all sites.

We count unique sites named by fork or reference.

See Over the Horizon Later for overnight stats.

Over the Horizon Later

Handy command for finding scrape data files modified within the last week.