💾 Archived View for jb55.com › ward.asia.wiki.org › neo4j captured on 2022-01-08 at 14:17:36. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Neo4J

Neo4j is an open-source graph database implemented in Java and accessible from software written in other languages using the Cypher query language through a transactional HTTP endpoint. wikipedia site

wikipedia

site

See Neo4J Resources

Neo4J Resources

See Neo4J Production

Neo4J Production

See Neo4J Optimization

Neo4J Optimization

I've sought the kind assistance of work colleague, Erika Arnold, author of wikiGraph, a shortest-path visualizing application for Wikipedia based on Neo4J. page

page

Here I follow her approach.

Load

Build node and relation csv files from Search Index Downloads with a ruby converter that assigns numeric ids to sites, pages and titles. github

Search Index Downloads

github

Runtime 3.5 min, 8 mb output. We now repeat this build after every scrap. Look for new data at 1:00 & 7:00, am & pm, pacific time. nodes.csv rels.csv

nodes.csv

rels.csv

Build a graph database from csv files with neo4j's import command. docs

docs

IMPORT DONE in 10s 298ms. 92237 nodes 321452 relationships 92237 properties

Find where neo4j resources have been installed.

Move the constructed db to the server's realm.

Edit the config and restarted the server.

Open an ssh tunnel to the remote server.

Then view the graph using the builtin app. localhost

localhost

Query

It's hard to know what to look for until you have a real need and some experience formulating queries. I read docs and tried things. Some impressed me enough to save the svg.

For fast queries find a good place to start and then traverse from there. I picked .org sites and looked for links to titles about Education. svg

svg

I try retrieving nodes by the shape of their relations alone. This is slow. I find sites that have/link the same title. svg

svg

Top page counts for happening sites.

pages,site 3089,don.ny2.fedwikihappening.net 294,machines.alyson.sf.fedwikihappening.net 225,frances.uk.fedwikihappening.net 220,kate.au.fedwikihappening.net 216,maha.uk.fedwikihappening.net 178,tim.au.fedwikihappening.net 164,chamboonline.sf2.fedwikihappening.net 147,jon.sf.fedwikihappening.net 140,jenny.uk.fedwikihappening.net 134,thoka.uk2.fedwikihappening.net 134,sarah.uk.fedwikihappening.net 133,alyson.sf.fedwikihappening.net 119,audrey.sf.fedwikihappening.net 106,cogdog.sf.fedwikihappening.net

Shortest path between titles with sites that hold the pages along the way. svg

svg

But wait, this path goes through unrelated 'scratch' pages. It also disregards the relations' direction. We need to constrain the path to sites we know.

Knows

Revise the batch import to include sites found on each page as KNOWS between a Page and neighborhood Sites. github

github

Add directional HAS|KNOWS pattern to the shortest path to constrain result to operationally discoverable sites. I add Titles to the path ends with IS relations. This adds a bit of ambiguity as to which page we're starting at. svg

svg

I've tested the path by clicking through it. It works.

We can find the sites with the most neighbors by counting distinct KNOWS relations.

The numbers are much higher than we might expect. This is because we conflate forks, references and rosters while we scrape. For the graph database we should do better.

526 don.ny2.fedwikihappening.net 407 ward.asia.wiki.org 363 search.fed.wiki.org:3030 170 journal.hapgood.net 160 david.viral.academy 157 c0de.academy 131 wiki.viral.academy 128 david.bovill.me 125 forage.ward.fed.wiki.org 115 ward.fed.wiki.org 110 machines.hapgood.net 108 fedwiki.jeffist.com 96 tim.federatedwiki.org 94 chamboonline.sf2.fedwikihappening.net 92 sfw.mcmorgan.org 92 edfedwiki.com 90 sarah.uk.fedwikihappening.net 88 jenny.uk.fedwikihappening.net 87 maha.uk.fedwikihappening.net 86 tim.au.fedwikihappening.net

This agrees with the command line word count of the site wide rollup of site.txt files.

We can write a meaningful who-links-here that will resolve to the page at least as a twin. svg

svg