💾 Archived View for auragem.letz.dev › devlog › 20220722.gmi captured on 2024-03-21 at 15:20:14. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-09-28)
-=-=-=-=-=-=-
One of the biggest criticisms of Google and other big search engines is that they put too much emphasis on popularity. To call this searching by relevance is disingenuous, because popularity has hardly anything to do with the relevance of the search, and ends up rewarding corporations, trends, and ad-based sites. Algorithms based on popularity or link counts assume that people want the most popular or "authoritative" pages.
However, the biggest mistake is presuming that more authoritative pages are ones that are being linked to the most, especially when making this presumption on the internet where false information spreads like wildfire. This is a mistake the SALSA algorithm makes. The fact that the SALSA paper calls this ranking "authoritative" reveals how SALSA views link-based ranking. If there was no association between the ranking and how it is viewed as authority, then they would not have used that particular word to describe their ranking system. That word was intentionally used in the paper for SALSA.
Currently, 2 out of the 4 search engines for gemini make similar assumptions to the above: TLGS (which uses SALSA) and Kennedy (which uses Popularity-based ranking). Part of the reason why is the poor results from just searching content and metadata. These problems are also outlined below. Ranking by taking into account the number of links and backlinks is one way to get around the problems of content-only ranking, particularly content farming.
While Google and Bing use these types of algorithms, they are also much worse than any Search Engines on Gemini, because they often prioritize their own content, ads, and also further rank based on SEO. They also do a lot of tracking.
GUS and AuraGem Search (my own search engine) are both FTS ranking search engines. However, GUS seems to have problems with searching based on relevance, which could be the result of how the search querying is implemented, or what information it collects of each capsule or page.
Kennedy (uses Popularity-based algorithm)
AuraGem Search (FTS ranking only)
Search Engines need to undergo much more criticism than they are now, and Geminispace provides us the opportunity to do this at a time when the space is not overcome by corporations and giant capsules that dominate the space. The goal of this article is to question existing ranking algorithms and detail the issues with them.
There are 3 older link-based ranking systems that are important for our discussion. HITS was the first system out of these three. It grouped pages into authorities and hubs. Hubs are pages that contain many links, and authorities are pages that have many backlinks. The higher the number of backlinks, the higher the ranking as an authority. A high-ranking authority is the result of many links from high-ranking hubs.
However, there is an effect that happens with this method - called the Tightly-Knit Community effect. This is when you have a group of pages that all inter-connect to each other with links, and therefore reinforce each other. This is one of the effects that is abused with link farming, discussed below. The Tightly-Knit Community effect often results in the community being at the very top of the search results, taking away the chance for other pages or groups of pages from being listed at the top. This is discussed in the SALSA paper linked below.
SALSA is an improvement of HITS that makes the algorithm quicker to run, and tries to improve the prevention of the TKC effect. As shown in the paper on SALSA, the TKC effect is reduced, resulting in the top results being spread out more between the top communities of pages, as described in the quote below:
All 10 top authorities found by the Mutual Reinforcement approach are pro-life resources, while the top 10 SALSA authorities are split, with 6 pro-choice sites and 4 pro-life sites (which are the same top 4 pro-life sites found by the Mutual Reinforcement approach). Again, we see the TKC effect: The Mutual Reinforcement approach ranks highly authorities on only one aspect of the query, while SALSA blends authorities from both aspects into its principal community.
But this does not mean the TKC effect has been completely eradicated or solved. Two things should be noted about HITS and SALSA:
In other words, a good hub represents a page that pointed to many other pages, while a good authority represents a page that is linked by many different hubs.
- From gemini://gemi.dev/cgi-bin/wp.cgi/view?HITS+algorithm
I reject the notion that this will provide decent search results on logic and theory grounds. This system is very easy to abuse, and ends up prioritizing the big corporations, capitalistic sites, and trending pages over everything else.
It is also unable to distinguish the reasons behind linking pages. It makes the assumption that linking to a page signifies that the author viewed that page as a good authority, and therefore the ranking increases. However, what if someone linked to a page to criticize it? What if someone linked to a page of misinformation to warn against it? The HITS and SALSA algorithms presents popular and trending content as if it was authoritative when it cannot know what is authoritative and what is not. At least PageRank is more aware of what they are really ranking by calling the rank "influence" or "importance".
There is also a circular problem with this system. New sites automatically get a very low ranking, which reduces their discoverability, and consequently reduces the ability for them to get links from other pages to boost their rankings. One's current ranking affects one's future ranking, and the only way to get out of this cycle is if either users waded through all of the popular search results to get to the ones with low rankings, or if the authors of these new sites contact people or post on forums to get others linking to their site so that their rankings can be increased. These authors and the people who want to discover content end up needing to completely side-step the search engine! Or, if one is unable to do this, one ends up trying to game the system with SEO methods, because they have to.
However, this is not the only questionable part about SALSA.
It is important to keep in mind the main goal of broad-topic WWW searches, which is to enhance the precision at 10 of the results, not to rank the entire collection of sites correctly. It is entirely irrelevant if the site in place 98 is really better than the site in place 216. The Stochastic ranking, which turns out to be equivalent to a weighted in-degree ranking, discovers the most authoritative sites quite effectively (and very efficiently) in many (carefully assembled) collections. No claim is made on the quality of its ranking on the rest of the sites (which constitute the vast majority of the collection).
The authors of SALSA are very clear that they prioritize the top 10 results and their order, regardless of how well the ordering of the rest of the results is. This assumption prioritizes a specific type of searching that is not necessarily universal. Searching for discovering pages is different from searching for getting to a specific page.
This also introduces the argument that the ranking systems are really only important for underspecified queries (broad queries), so the emphasis on the problems with ranking algorithms is unwarranted. This argument hardly makes sense when the majority of searches that people make are broad. I would also argue that broad searches are most used for *discovering* pages, not for getting to a specific page. However, ranking based on popularity prioritizes what it thinks people would want, which is more suited for specific searches using broad queries, at the expense of discovery of broad topics. Broad discovery using broad topic queries and specific searches using proper-noun queries or very specific queries are both much better ways of dealing with searches without relying on popularity.
PageRank is a similar link-based ranking system, developed by Larry Page and Sergey Brin for the Google Search Engine in the late 1990s. The patent expired in 2019. Google still uses PageRank but has since introduced other algorithm updates on top of it, including Google Panda.
PageRank makes very similar assumptions to HITS and SALSA:
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or mayoclinic.org. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself.
- From gemini://gemi.dev/cgi-bin/wp.cgi/view?PageRank
The reason that PageRank is interesting is that there are many cases where simple citation counting does not correspond to our common sense notion of importance. For example, if a web page has a link off the Yahoo home page, it may be just one link but it is a very important one. This page should be ranked higher than many pages with more links but from obscure places. PageRank is an attempt to see how good an approximation to "importance" can be obtained just from the link structure.
[...] a page has high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has a few highly ranked backlinks
- From PageRank Citation Ranking Paper
The same problems and assumptions that apply to HITS and SALSA also apply to PageRank. High Authority Hubs are often giant corporations (Yahoo in 1990s, Google, Facebook, etc.), highly linked or popular sites, but can also include academic sites, government sites, etc. The problem is highly linked sites become authorities that then increase the ranking of pages they link to. This prioritizes corporations and ad-based companies that pay websites to link to their site to increase their ranking, and then prioritizes the links these high-ranking sites link to.
This is not a good system for discoverability and ends up making corporations and ad-based sites even more popular, further increasing their influence on search results. This system is very easy to abuse, especially by corporations and spammers, which leads us into the next topic.
PageRank Citation Ranking Paper
How Google Search Works - Archive.org
Because many search engines use link-based ranking, link farming became popular. It consists of creating pages full of links to try to boost your ranking in the search engine, or posting links in guestbooks, comments sections, wikis, etc. However, the web was able to band-aid fix this by creating a "nofollow" attribute on links that tell search engines not to use that link for ranking. But Gemini does not have this. What Gemini can do is strip link lines, and then if the search engine only follows link lines, then they will not see those links. However, this decreases useability in many cases (particularly wikis).
If a website author can post comments on other sites with links to their site, they are also able to post a link to those comments within their own site, effectively creating a TKC that reinforces and increases the ranking of those pages.
Many search engines try to combat this by trying to detect these link farms, but this then introduces the possibility of false positives.
Currently, as far as I am aware, TLGS and Kennedy are both prone to this. AuraGem and GUS are not.
Link-based ranking likely became popular due to the limitations of content-only ranking. For example, simple Full-Text Search systems don't necessarily take into account linguistic factors and might simply split up words and search for exact matches of those within the page's contents. This then doesn't take into account derived word forms, conjugated verbs, or declined nouns. This includes different verb tenses, plural vs. singular nouns, noun cases (for inflected languages), etc. One way to fix this is by rewriting search queries to include all of the forms of a word so that they will be searched for in pages.
An additional problem is when grammatical words are searched for. This includes very common words like "is", "are", "the", etc. Typically, search engines might strip or group these words. For example, if you search for "the pub", search engines that don't do any natural language processing will try to match both words separately and end up bringing back lots of results that just use the word "the". Some better search engines might prioritize the word "pub" so at least matches for that word are higher up in the search results. Or, other search engines might search for the whole group "the pub" - they can detect that "the" is a signifier of a noun phrase in English, and so searches for the whole phrase. But then it might not take into account "a pub" or other variations, unless more rewriting or NLP happens.
One method of reducing over-matching of content is stripping parts of a page that are detected to be irrelevant or not useful for searching. This is often used for the web, but less useful for Gemini, aside from stripping preformatted sections, text art, and section dividers.
Content-based ranking might also rank based on file metadata as well. AuraGem uses this for audio files by searching the file tags on mp3 and ogg files. However, there are also problems where search results include too many links that include one word from the query. For example, searching for Station could result in tons of links that use "Station" in their title, and station's actual site might be demoted.
This system can still be abused, however...
Because many search engines either cache page contents and do full-text searching on the contents, or use keyword extraction, content farming has become common. It consists of entering a bunch of words into your content that are solely used to boost the ranking of your page. If search engines collect the number of times a keyword is used on a page, like very old search engines used to do prior to 2000, then one could just repeat a keyword a bunch of times. A search engine instead could limit the number of keywords or the count of a keyword, or could detect back-to-back lists of keywords or tags.
Nowadays, content farming comes largely in the form of copying and/or rewriting another popular article so that the search engine will boost this new plagiarized article, especially when people use words from these popular articles to search for more information on the same topic. This system rewards unoriginal and repeated content, which is why we get so much superficial and unoriginal content on news websites.
Content Farm Gemipedia Article
Google tried to fix these problems by releasing an update to their search engine called Google Panda.
Google Panda Gemipedia Article
Natural-Language Processing is a topic that's being researched much more often now. However, the focus seems to be on using it as fixes on top of link-based or content-based ranking for detecting spam, or for voice assistants, voice-to-text, and text-to-voice. What is not being researched as much is using NLP as the very base for ranking and indexing content.
Some problems with indexing content are how indexers break up words or lexical units. Not all languages are as obvious as English. Some languages are highly synthetic, like polysynthetic languages, and are able to combine many units of meaning into one word, often forming a whole sentence with one "word". One language that is on the synthetically higher end of inflected langauges is German, which has high derivational synthesis (synthesis that derives words from other words). An example of a polysynthetic language is the Australian language Tiwi.
Synthetic Language on Gemipedia
Polysynthetic Language on Gemipedia
One idea that has always interested me is the ability to search languages based on linguistic attributes. I have a project called Orpheus that can store languages in a generalized AST that stores linguistic information (like phrases/clauses, modifiers, nouns, verbs, and all of their associated attributes, including verb voice, aspect, noun number, case, etc.). Searching based on linguistic information could prove much more powerful than current methods of query rewriting or only relying on NLP of queries.
While I have this project that can store language content in an AST, the next step would be parsing many languages into this AST. And then I would need to develop a means of searching this AST in a low-level fashion. And finally, a search engine then might parse the search query, construct that into an AST search to search through already-parsed page contents, and finally give back pages with linguistic matches. This would already take care of word derivations or different inflectional forms of words, but it could be used for even more powerful searches.
The very first thing to solving a problem is recognizing that there is a problem. You cannot fix a problem you do not want to admit is a problem. The next thing is not taking existing Search Engine Algorithms for granted as being good. If the only thing you have to compare link-based ranking to is simple Full-Text Searching, then you get a false impression that link-based ranking is good, when really it is just the only thing you've experienced. You don't know better exists if you haven't seen it. That is why we always strive for the better. Sometimes this requires questioning already-existing practices. The newer Programming languages did this. The pubnix and smolnet communities did this. Science does this. And Gemini did this too.
If users desire SALSA- or popularity-ranked search results, then I direct them to Kennedy and TLGS, respectively, which use those two algorithms well. They are already great search engines that provide that functionality, so I do not see a reason for AuraGem Search to replicate them. Instead, I intend AuraGem Search to experiment with other methods of ranking, especially NLP.
One of the things that I think is useful is questioning the assumption that what the user desires *must* be one of the first 5-10 search results. Content-based ranking becomes much more manageable if we allow more room. This is also why AuraGem lists search results in a very condensed format, to get as many results on the screen at once as possible.
AuraGem currently doesn't solve any of the content farming issues. However, I will be slowly making changes to how searching works over time. New devlogs will be made detailing these changes and why they were made.