💾 Archived View for jb55.com › ward.asia.wiki.org › search-index-downloads captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
We make available the search index as downloadable files in several formats including json objects designed for network graphing.
Our sitemap scraper runs four times a day. Logs from each run can be viewed online. page
We report the page count and domain name of sites found to be online and reporting pages in their sitemaps. page
We distribute the index files individually and in a single 48 megabyte compressed tar file. tgz
The index is organized as a collection of text files containing unique words extracted from various fields of each page. These are grouped into directories by site and then page within site.
We include federation wide rollups of the files which we don't use to search but maintain anyway.
We now include a federation wide rollup containing the slugs of pages found in sites searched. This might be useful for title completion. It is an experiment. txt
We accumulate various counts in another text file with one line of json for each scrape. txt
Here we show a sample line after being formatted as indented text. The scan counts are read from logs while the index counts are line counts of the site rollup text files.
See Sitemap Scrape Statistics for counts plotted.
We aggregate information from the index into single files representing graphs as node and arcs in two forms.
Nodes are site names and arcs are remote sites. json
Nodes are page slugs and arcs are internal links. json
We offer javascript versions of the aggregated graph data files that can be included in a web page with a script tag. site-web.js slug-web.js
Title Network Browser allows one to navigate from page title to page title following links going forward and backwards.
Site Network Diagram shows all visible sites connected by arcs where there are neighborhood citations.
Recent Activity Report showing sites found to have new activity in the last week.
Neo4J with batch loading and experimental interactive query plugin.
Item Distribution computed from site and page items.txt.
Possibly intersect with the list of bad words. twitter file