💾 Archived View for jb55.com › ward.asia.wiki.org › neo4j captured on 2022-01-08 at 14:17:36. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-04)
-=-=-=-=-=-=-
Neo4j is an open-source graph database implemented in Java and accessible from software written in other languages using the Cypher query language through a transactional HTTP endpoint. wikipedia site
See Neo4J Resources
See Neo4J Production
See Neo4J Optimization
I've sought the kind assistance of work colleague, Erika Arnold, author of wikiGraph, a shortest-path visualizing application for Wikipedia based on Neo4J. page
Here I follow her approach.
Build node and relation csv files from Search Index Downloads with a ruby converter that assigns numeric ids to sites, pages and titles. github
Runtime 3.5 min, 8 mb output. We now repeat this build after every scrap. Look for new data at 1:00 & 7:00, am & pm, pacific time. nodes.csv rels.csv
Build a graph database from csv files with neo4j's import command. docs
IMPORT DONE in 10s 298ms. 92237 nodes 321452 relationships 92237 properties
Find where neo4j resources have been installed.
Move the constructed db to the server's realm.
Edit the config and restarted the server.
Open an ssh tunnel to the remote server.
Then view the graph using the builtin app. localhost
It's hard to know what to look for until you have a real need and some experience formulating queries. I read docs and tried things. Some impressed me enough to save the svg.
For fast queries find a good place to start and then traverse from there. I picked .org sites and looked for links to titles about Education. svg
I try retrieving nodes by the shape of their relations alone. This is slow. I find sites that have/link the same title. svg
Top page counts for happening sites.
pages,site 3089,don.ny2.fedwikihappening.net 294,machines.alyson.sf.fedwikihappening.net 225,frances.uk.fedwikihappening.net 220,kate.au.fedwikihappening.net 216,maha.uk.fedwikihappening.net 178,tim.au.fedwikihappening.net 164,chamboonline.sf2.fedwikihappening.net 147,jon.sf.fedwikihappening.net 140,jenny.uk.fedwikihappening.net 134,thoka.uk2.fedwikihappening.net 134,sarah.uk.fedwikihappening.net 133,alyson.sf.fedwikihappening.net 119,audrey.sf.fedwikihappening.net 106,cogdog.sf.fedwikihappening.net
Shortest path between titles with sites that hold the pages along the way. svg
But wait, this path goes through unrelated 'scratch' pages. It also disregards the relations' direction. We need to constrain the path to sites we know.
Revise the batch import to include sites found on each page as KNOWS between a Page and neighborhood Sites. github
Add directional HAS|KNOWS pattern to the shortest path to constrain result to operationally discoverable sites. I add Titles to the path ends with IS relations. This adds a bit of ambiguity as to which page we're starting at. svg
I've tested the path by clicking through it. It works.
We can find the sites with the most neighbors by counting distinct KNOWS relations.
The numbers are much higher than we might expect. This is because we conflate forks, references and rosters while we scrape. For the graph database we should do better.
526 don.ny2.fedwikihappening.net 407 ward.asia.wiki.org 363 search.fed.wiki.org:3030 170 journal.hapgood.net 160 david.viral.academy 157 c0de.academy 131 wiki.viral.academy 128 david.bovill.me 125 forage.ward.fed.wiki.org 115 ward.fed.wiki.org 110 machines.hapgood.net 108 fedwiki.jeffist.com 96 tim.federatedwiki.org 94 chamboonline.sf2.fedwikihappening.net 92 sfw.mcmorgan.org 92 edfedwiki.com 90 sarah.uk.fedwikihappening.net 88 jenny.uk.fedwikihappening.net 87 maha.uk.fedwikihappening.net 86 tim.au.fedwikihappening.net
This agrees with the command line word count of the site wide rollup of site.txt files.
We can write a meaningful who-links-here that will resolve to the page at least as a twin. svg