Change Log
2021 November Update
- Further refinements to keyword extraction. The technically minded can read a few words about that here:
/log/37-keyword-extraction.gmi
- Improved crawling logic to offer more leniency toward sites that have high ranking. This improves the chance of pushing through local minima and discovering additional quality content on those sites.
- Mended some fences with a few of the websites that blocked my crawler when it young and unruly, and removed a few sites from the blocklist that didn't belong there. More quality websites in the index!
- As an experiment, tagged websites that contain links to amazon, attempt to place cookies on the crawler, contain links to known trackers, contain audio/video tags, and contain javascript.
It's not perfect, it will miss some trackers as well as mistake some honest amazon links for affiliate links.
These special keywords are available:
js:true
js:false
special:cookies
special:affiliate
special:media
special:tracking
You can of course also exclude them
"keyboard -special:tracking -special:affiliate".
- Added outgoing links as search terms. Up to 25 per page. Great for ego-searching.
Example:
"links:archive.org"
will list pages that link to archive.org. This is only available on the highest level of domain, you can't for example search for "search.marginalia.nu", only "marginalia.nu".
Exploration Mode (Experimental)
If you press the little "🔀" icon next to a search result, you will be brought to a list of domains that might be similar. From there you can keep pressing "🔀" again to explore the web.
This is perhaps best used for navigating the blogosphere, neocities, and similar digital communities.
This is an experimental feature and the user interface is really rough, but it's a lot of fun so that's why I've made it accessible to the public.
I particularly enjoyed this rabbit hole.
2021 October Revamp
- Introduced a ranking algorithm that takes into consideration both the average quality of the domain, and the number of links to the domain (and their quality). This should mean fewer garbage results and less SEO spam.
- Added ANOTHER ranking algorithm along with the first one, a modified PageRank that aggressively biases toward personal websites.
- Drastically improved keyword extraction and topic identification quite a bit.
- Support for many new types of keywords, including: C#, .308, 5.56mm, comp.lang.c, #hashtag, 90210.
- Added the ability to filter on page properties like javascript and declared HTML standard (based on DTD first and guesswork as a fallback).
Known Problems
- The minus keyword doesn't work super reliably.
- Keyword extraction may be a bit too conservative.
2021 September Bugfixes and Tweaks
- Reformulated some error messages that words can only exist within a Latin-1 encoding. Also added some automatic suggestions when there are few results, with a link to a tips page.
- Fixed a bug where the indexes weren't queried in the right order, and good results would in some circumstances be overwritten with worse results.
- Fixed a bug where the same domain could appear too many times in the results.
- Search profiles have been added, and the default is a more narrow configuration that's intended to reduce the noise of completely irrelevant search results. I'm not sure if this is necessary with the bug fixes above.
- Added support for curly quotes, as some operating systems apparently use those.
2021 September Maintenance
- A full index rebuild. This is mainly to allow for a change in internal modelling that will fix some jankiness.
- It also allows for an improvement in index bucketing. This will hopefully improve the quality of the results.
- Topic extraction has been improved, among the changes, the crawler will use word-capitalization to pick up likely topics of a page.
Further changes:
- Unsupported foreign languages are detected and filtered out more aggressively than before. For now the search engine targets: English, Latin and Swedish. Additional languages may come in the future, but I will probably need to recruit help, as I have no way of ensuring the quality of results I can't read.
- Even more aggressive link farm detection.
- Charset encoding defaults to ISO8859-1 in the absence of UTF-8 being requested. This prevents a lot of garbled descriptions.
2021 August - Quality of Life updates
A lot of small features have been added to improve the usefulness of the search engine in finding information.
- Support for define:-queries that retreive data from wiktionary.
- Mathematical expression evaluations and unit conversions (a bit janky still).
- Spell checking for search terms that return no results. If "Farenheit" gives no results, you will be provided with the suggestion to try "Fahrenheit".
- The search engine will provide links to (hopefully) useful wikipedia entries.
2021 July Index Rebuild
The index has been reconstructed (actually several times) to allow for new and exciting dimensions of search. Follows is a summary of some of the bigger feature-changes.
- Search results are presented in an order that is likely more useful. Results that contain search terms will be boosted, and the number of links to the results will affect the order of presentation, but is not part of the indexing and crawling considerations, so the same set of results will be presented as previously -- this is not, and never will be a popularity contest.
- Support for a wider dictionary of search terms, including words that include numbers, and sequences of up to four words. The search engine will automatically try pairs of words when searching, but additional words will be considered if they are placed within quotes.
- Resilience improvements! The index can recover from mild data corruption in a highly best-effort fashion, and the index will recover much faster if it needs to restart, from 30-60 minutes down to 5 minutes.
- Blacklisting of link- and content-farms is implemented even more aggressively than in previous versions. There are some areas where an especially heavy hand needed to be employed, including pages pertaining to cryptocurrencies and alarm-systems.
- Mobile support has been improved, the contents of the page will no longer overflow.
- Terminal based browser support has been improved as well.
Navigation
Back to Index
Reach me at kontakt@marginalia.nu