💾 Archived View for jb55.com › ward.asia.wiki.org › sitemap-scrape-improvements captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
The Ruby Sitemap Scrape provides the first full text search of the visible federation. We've learned a lot by building this ourselves from grep-like utilities. Here we list todos that have surfaced and been completed.
Find all pages that share any items with the current page. Prototype search link now available. github
Scrape item ids and save them in items.txt files. Devise some convenient way to initiate a search from any paragraph. See Link Symmetry
Refactor search.fed.wiki.org to have separate what, help and query details pages. Keep them up to date.
match: <input type=radio checked> and <input type=radio> or
Add Newly Found Sites to the activity report even if they are not recently active. This would be reporting the consequence of some other activity that linked to the site.
Add a permalink to the search results so that searches can be saved and rerun with a single click. search
Find and remove old rosters after a week or so. Possibly merge them to have into whole days before that.
Momentarily defeat the scrape's incremental mechanism in order to retrieve the new indices, items.txt and plugins.txt from all pages. See Full Scrape
Scrape item types and same them in plugins.txt files.
Add html plugin to sfw.c2.com since we're now generating lots of html items.
Improve the grep sequence so it doesn't blow up with "too many arguments" from the shell. github
We've somehow lost utf-8 decoding in the scraper. These error messages are new. 140 sites lost from view. This was first successful run from cron. Solution online. post
We've figured out how to set CORS headers on the port 3030 sinatra server that delivers the recent-activity.json after giving up on the default 'public' behavior. github
Grep the words.txt with a simple web app. site
Recent activity now includes new sites in a more compact format. github
I've added a report listing all sites with ten or more pages. I attempt to group these logically based on their subdomain hierarchy.
I've revised my 2011 cron job that feeds home sensor network data into the federation on a five minute cycle. This polluted the scrape's activity report until I modified the perl script to date pages with the install date of each sensor, not the date of the reading. site
I've revised my 2012 cron job that reports farm activity to date the activity in the journal with the date that it happened. A second commit suspends reporting until there is activity after the last report. github github