Sitemap Scrape Improvements

The Ruby Sitemap Scrape provides the first full text search of the visible federation. We've learned a lot by building this ourselves from grep-like utilities. Here we list todos that have surfaced and been completed.

Ruby Sitemap Scrape

Find all pages that share any items with the current page. Prototype search link now available. github

github

Scrape item ids and save them in items.txt files. Devise some convenient way to initiate a search from any paragraph. See Link Symmetry

Link Symmetry

Refactor search.fed.wiki.org to have separate what, help and query details pages. Keep them up to date.

match: <input type=radio checked> and <input type=radio> or

Add Newly Found Sites to the activity report even if they are not recently active. This would be reporting the consequence of some other activity that linked to the site.

Newly Found Sites

Add a permalink to the search results so that searches can be saved and rerun with a single click. search

Find and remove old rosters after a week or so. Possibly merge them to have into whole days before that.

Momentarily defeat the scrape's incremental mechanism in order to retrieve the new indices, items.txt and plugins.txt from all pages. See Full Scrape

Full Scrape

Scrape item types and same them in plugins.txt files.

Add html plugin to sfw.c2.com since we're now generating lots of html items.

Improve the grep sequence so it doesn't blow up with "too many arguments" from the shell. github

github

We've somehow lost utf-8 decoding in the scraper. These error messages are new. 140 sites lost from view. This was first successful run from cron. Solution online. post

post

We've figured out how to set CORS headers on the port 3030 sinatra server that delivers the recent-activity.json after giving up on the default 'public' behavior. github

github

Grep the words.txt with a simple web app. site

site

Recent activity now includes new sites in a more compact format. github

github

I've added a report listing all sites with ten or more pages. I attempt to group these logically based on their subdomain hierarchy.

I've revised my 2011 cron job that feeds home sensor network data into the federation on a five minute cycle. This polluted the scrape's activity report until I modified the perl script to date pages with the install date of each sensor, not the date of the reading. site

site

I've revised my 2012 cron job that reports farm activity to date the activity in the journal with the date that it happened. A second commit suspends reporting until there is activity after the last report. github github

github