💾 Archived View for jb55.com › ward.asia.wiki.org › search-index-downloads captured on 2022-01-08 at 14:02:59. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-12-04)

-=-=-=-=-=-=-

Search Index Downloads

We make available the search index as downloadable files in several formats including json objects designed for network graphing.

Our sitemap scraper runs four times a day. Logs from each run can be viewed online. page

page

We report the page count and domain name of sites found to be online and reporting pages in their sitemaps. page

page

We distribute the index files individually and in a single 48 megabyte compressed tar file. tgz

tgz

Index

The index is organized as a collection of text files containing unique words extracted from various fields of each page. These are grouped into directories by site and then page within site.

We include federation wide rollups of the files which we don't use to search but maintain anyway.

We now include a federation wide rollup containing the slugs of pages found in sites searched. This might be useful for title completion. It is an experiment. txt

txt

Counts

We accumulate various counts in another text file with one line of json for each scrape. txt

txt

Here we show a sample line after being formatted as indented text. The scan counts are read from logs while the index counts are line counts of the site rollup text files.

See Sitemap Scrape Statistics for counts plotted.

Sitemap Scrape Statistics

Graphs

We aggregate information from the index into single files representing graphs as node and arcs in two forms.

Nodes are site names and arcs are remote sites. json

json

Nodes are page slugs and arcs are internal links. json

json

We offer javascript versions of the aggregated graph data files that can be included in a web page with a script tag. site-web.js slug-web.js

site-web.js

slug-web.js

Applications

Title Network Browser allows one to navigate from page title to page title following links going forward and backwards.

Title Network Browser

Site Network Diagram shows all visible sites connected by arcs where there are neighborhood citations.

Site Network Diagram

Recent Activity Report showing sites found to have new activity in the last week.

Recent Activity Report

Neo4J with batch loading and experimental interactive query plugin.

Neo4J

Item Distribution computed from site and page items.txt.

Item Distribution

Possibly intersect with the list of bad words. twitter file

twitter

file