💾 Archived View for marginalia.nu › projects › edge › changelog.gmi captured on 2023-01-29 at 02:59:10. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Change Log

Detailed changelog available here:

https://git.marginalia.nu/marginalia/marginalia.nu/graph?branch=refs%2Fheads%2Frelease

2022 August:

Recipe filter

Ad detection

Query time optimization

2022 June-July:

Overhaul of the crawler and database model, index and database reconstructed.

2022 May Changes

Project goes Open Source

https://git.marginalia.nu/marginalia/marginalia.nu

Added support for a few !bangs, currently !g and !ddg

2022 April Changes

Added type-ahead suggestions for desktop.

New index backend based on a B-tree variant.

Reworked the crawler to be more compatible with the WARC format.

2022 March Changes

Side-loaded all of StackExchange and StackOverflow.

Improved the blogocentric algorithm to prioritize smaller sites more effectively.

Removed some mastodon instances from random mode as they aren't very interesting to visit, you just get a log-in screen.

Optimized exploration mode as it was getting quite sluggish.

Added a drilldown link on the search results for narrowing the search to the same domain.

Tuned down the amount of Mastodon instances that crop up in Random Exploration mode. I like the idea of these sites, but there are so many of them and they only show you a sign-up screen when you visit them.

2022 February Changes

Slightly relaxed the hard limit on how much javascript is allowed on a page, since better heuristics have been found, and this limit does throw out a lot of babies with the bathwater.

Work has been almost at a standstill due to some health issues. I hope to get more productive again soon.

2022 January Changes

Fixed a minor bug that broke among others, the site:-search

Overhaul of the web design for the search engine.

Random-feature has gotten site screenshots to offer a "flavor" of the site. Site-info is much improved as well.

API access

https://api.marginalia.nu/

2021 December Changes

Crawling is smarter and uses the ranking algorithm for prioritizing the order of the results.

Search results are better sorted in terms of how important the search terms are in relation to the query.

The query parser is a lot smarter and generates better alternative search terms to supplement the main query (pluralization, concatenation), guided by a term frequency dictionary.

Additional keywords are extracted for each document. This will add more junk results at the bottom of the page, but hopefully more good matches too.

The maximum query length has been restricted.

Additional Technical Details

2021 November Update

Further refinements to keyword extraction. The technically minded can read a few words about that here:

/log/37-keyword-extraction.gmi

Improved crawling logic to offer more leniency toward sites that have high ranking. This improves the chance of pushing through local minima and discovering additional quality content on those sites.

Mended some fences with a few of the websites that blocked my crawler when it young and unruly, and removed a few sites from the blocklist that didn't belong there. More quality websites in the index!

As an experiment, tagged websites that contain links to amazon, attempt to place cookies on the crawler, contain links to known trackers, contain audio/video tags, and contain javascript.

It's not perfect, it will miss some trackers as well as mistake some honest amazon links for affiliate links.

These special keywords are available:

js:true

js:false

special:cookies

special:affiliate

special:media

special:tracking

You can of course also exclude them

"keyboard -special:tracking -special:affiliate".

Added outgoing links as search terms. Up to 25 per page. Great for ego-searching.

Example:

"links:archive.org"

will list pages that link to archive.org. This is only available on the highest level of domain, you can't for example search for "search.marginalia.nu", only "marginalia.nu".

Exploration Mode (Experimental)

If you press the little "🔀" icon next to a search result, you will be brought to a list of domains that might be similar. From there you can keep pressing "🔀" again to explore the web.

This is perhaps best used for navigating the blogosphere, neocities, and similar digital communities.

This is an experimental feature and the user interface is really rough, but it's a lot of fun so that's why I've made it accessible to the public.

I particularly enjoyed this rabbit hole.

2021 October Revamp

Introduced a ranking algorithm that takes into consideration both the average quality of the domain, and the number of links to the domain (and their quality). This should mean fewer garbage results and less SEO spam.

Added ANOTHER ranking algorithm along with the first one, a modified PageRank that aggressively biases toward personal websites.

Drastically improved keyword extraction and topic identification quite a bit.

Support for many new types of keywords, including: C#, .308, 5.56mm, comp.lang.c, #hashtag, 90210.

Added the ability to filter on page properties like javascript and declared HTML standard (based on DTD first and guesswork as a fallback).

Known Problems

The minus keyword doesn't work super reliably.

Keyword extraction may be a bit too conservative.

2021 September Bugfixes and Tweaks

Reformulated some error messages that words can only exist within a Latin-1 encoding. Also added some automatic suggestions when there are few results, with a link to a tips page.

Fixed a bug where the indexes weren't queried in the right order, and good results would in some circumstances be overwritten with worse results.

Fixed a bug where the same domain could appear too many times in the results.

Search profiles have been added, and the default is a more narrow configuration that's intended to reduce the noise of completely irrelevant search results. I'm not sure if this is necessary with the bug fixes above.

Added support for curly quotes, as some operating systems apparently use those.

2021 September Maintenance

A full index rebuild. This is mainly to allow for a change in internal modelling that will fix some jankiness.
It also allows for an improvement in index bucketing. This will hopefully improve the quality of the results.
Topic extraction has been improved, among the changes, the crawler will use word-capitalization to pick up likely topics of a page.

Further changes:

Unsupported foreign languages are detected and filtered out more aggressively than before. For now the search engine targets: English, Latin and Swedish. Additional languages may come in the future, but I will probably need to recruit help, as I have no way of ensuring the quality of results I can't read.
Even more aggressive link farm detection.
Charset encoding defaults to ISO8859-1 in the absence of UTF-8 being requested. This prevents a lot of garbled descriptions.

2021 August - Quality of Life updates

A lot of small features have been added to improve the usefulness of the search engine in finding information.

Support for define:-queries that retreive data from wiktionary.
Mathematical expression evaluations and unit conversions (a bit janky still).
Spell checking for search terms that return no results. If "Farenheit" gives no results, you will be provided with the suggestion to try "Fahrenheit".
The search engine will provide links to (hopefully) useful wikipedia entries.

2021 July Index Rebuild

The index has been reconstructed (actually several times) to allow for new and exciting dimensions of search. Follows is a summary of some of the bigger feature-changes.

Search results are presented in an order that is likely more useful. Results that contain search terms will be boosted, and the number of links to the results will affect the order of presentation, but is not part of the indexing and crawling considerations, so the same set of results will be presented as previously -- this is not, and never will be a popularity contest.
Support for a wider dictionary of search terms, including words that include numbers, and sequences of up to four words. The search engine will automatically try pairs of words when searching, but additional words will be considered if they are placed within quotes.
Resilience improvements! The index can recover from mild data corruption in a highly best-effort fashion, and the index will recover much faster if it needs to restart, from 30-60 minutes down to 5 minutes.
Blacklisting of link- and content-farms is implemented even more aggressively than in previous versions. There are some areas where an especially heavy hand needed to be employed, including pages pertaining to cryptocurrencies and alarm-systems.
Mobile support has been improved, the contents of the page will no longer overflow.
Terminal based browser support has been improved as well.

Navigation

Back to Index

Reach me at kontakt@marginalia.nu