💾 Archived View for jb55.com › ward.asia.wiki.org › search-index captured on 2022-01-08 at 14:35:56. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-04)
-=-=-=-=-=-=-
We consider how we might index and query multiple sites within a distributed search database routinely hosted in federated wiki page server/editors.
See Distributed Search which describes the social structure within which search is only one component.
See Distributed Hash Table for related peer-to-peer research spawned by Napster.
We want to find site/slugs for pages containing various keys, including words, ids, links and sites.
We will maintain indices on each kind of key mapping key to multiple site/slug values.
On each action, the old and new page will be analyzed for keys and the indexes updated to reflect changes.
Aside: Some optimization may be possible when only keys are added. If a new sentence is added to a paragraph, but no text deleted, then words and link indices could grow but no index will shrink and the full page content need not be reanalyzed.
We will query for collections of whole keys in an individual index. This could be optimized by sorting the keys and continuing an index traversal.
We can estimate the size of indices from data collected with the Ruby Sitemap Scrape.
See Multiway Trees for Knuth's treatment of the ubiquitous b-tree.
We believe some extension of neighborhood will be semantically useful and have described a version of 'region' where sites mentioned on a server are included out to one (possibly more) levels of indirection.
We must consider whether a region is an artifact of indexing or search.
If we index a region then as sites enter and exit the region (due to editing on the server) we must add or remove large numbers of index entries.
If we search by region then we must relay every query to the included sites with the possible optimization of recognizing sites hosted on the same server and thus in the same index.
Distributing search has the disadvantage that small sites with read-only content might not accept responsibility for search benefiting users of other sites. It would be better if full featured sites could carry the larger burden.
A complete search service will include scraping and query processing along the lines already implemented in the Ruby Sitemap Scrape and related mechanisms.
Comprehensive search solutions provide configurable machinery for all aspects of analysis, indexing, storage and query that one would expect of a centralized search service.
Of these issues we consider indexing to be the dominant technical challenge not addressed by mechanisms already largely in place.
At the core of all search is the log(n) behavior of an inverted index typically implemented with some version of the ubiquitous b-tree. wikipedia
We will succeed making search service widely distributed so long as its cost is modest and installation effortless. Small pure-javascript is preferred.
Strata looks attractive. It is a database primitive interface to a b‑tree. One can use Strata to create different types database strategies. npm
We may not need independent storage for leaf nodes, the value of key-value pairs. The b-tree index storage might find that site-slug pairs are sufficiently indirect to serve as leaf pointers.