💾 Archived View for station.martinrue.com › krixano › 9e47797414d94c3bbaa65adacb117e07 captured on 2024-02-05 at 11:50:56. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-09-08)
-=-=-=-=-=-=-
I have just made some edits to my latest devlog post. I wanted to make very clear that this is a *descriptive* study on how well Search Engines match user expectations. It does not focus on whether this is viable, or how to achieve such accuracy. Thanks for all who read it, and hopefully these edits make the purpose of the article more clear :)
gemini://auragem.space/devlog/20220807.gmi
2 years ago · 👍 freezr
[1] gemini://auragem.space/devlog/20220807.gmi
@freezr Another thing that helps Search Engine developers is if you tell them a *specific query* you used that you are getting bad results from, and what type of results you would expect. This way we have something tangible that we can check algorithm tweaks with and against. I have a search engine feedback page that one can use for this: gemini://auragem.space/search/feedback.gmi · 2 years ago
@freezr The reason why old articles tend to be prioritized is because old articles have had more time to be linked from other pages. This is one of the problems with HITS, SALSA, and PageRank that I discussed in my article "Search Engine Ranking Systems are Being Left Unquestioned" :D
Currently, AuraGem only uses Full-Text Searching, and doesn't do anything with links, so it's not as big of a problem with that. However, a lot of results end up with the exact same ranking, and then they end up getting ordered based on when those pages were indexed, with oldest being first. However, I was considering changing this, but I need to make sure it is balanced. · 2 years ago
@krixano these are my general thoughts...
The majority of the time you look for something is because you have a need NOW and you expect to find any info in your timeline; most of the time instead, I found information so old that are 95% useless. It doesn't matter how many times I put the year to get recent results, this hint is always ignored.
If I am looking to fix my old stuff does make sense getting old articles; but if I'm looking to fix a Linux issue for instance, it doesn't make sense that in 2022 you report, as firsts, links from 2012.
This really makes me understand these search engines aren't designed to give you truly results. · 2 years ago
@freezr Re: Chronological fashion
Interesting... more details please! :D
Do you always prefer chronological? Do you ever try to search old articles? Do you ever prioritize anything else above chronological? How would you want it ordered - oldest to newest (oldest first), or newest to oldest (newest first)? Would you want this done by default, or are you fine with adding a keyword to your search query to get this type of ordering?
Since AuraGem can parse publication dates, it could do this, but it needs to be balanced with what other users might want as well. However, I would need to handle pages that don't have a publication date as well. · 2 years ago
@krixano @moddedbear I see the risk in metadata being abused... 🤔
I personally prefer search engines that show results in a chronological fashion.
No matter what, anything I am searching today has very few aspects in common with the same thing ten years back... 🤔 · 2 years ago
@moddedbear You explained that much better and more succinctly than I ever could, lol. · 2 years ago
@freezr Speaking as someone who knows only very little about building a search engine here, but a good one should be able to parse all the important information out of a page and entire capsule without any extra metadata.
Metadata wouldn't be all that important anyway since there's plenty of other tricks to get that info, but it involves some guesswork. Words in headers are probably most important. The first header on a page can usually be regarded as the page title, and if that page is a capsule root then it can probably be interpreted as the capsule name. Words found on the root like "tech" or "poetry" likely describe all other pages. Word frequency... etc. · 2 years ago
@freezr Finally, your idea about classifying capsules within a search.txt file is the very thing I was thinking about when I posted the somewhat cryptic message about comparing how books are categorized to how capsules are categorized :D
The thing is, a lot of people won't end up using this, particularly because it takes a lot of work. And if we are to deal with classifying capsules, then we also need to take into account multi-functional capsules (capsules that go into multiple categories), as well as misclasifications and other abuses of the system. Search Engines need to work in the general case first, and then you can start adding more helpful things ontop of that. · 2 years ago
@freezr Also, nothing that I am writing about involves machine learning at all, btw. NLP does not *have* to be machine learning, and that was one of my criticisms I had with @haze's gemlog post about Search Engine bias vs. accuracy. · 2 years ago
@freezr Well, there's already something kinda like this that technically all the search engines have, called hashtags, or tags (or sometimes keywords). But the problem is what if someone just spams a bunch of irrelevant tags, or categories in your example, just to boost their search results. This is what the content farming that I discussed in my previous devlog was about. · 2 years ago
Doing that we have not extended or changed anything in the protocol area.
This is a deal that capsules do with search engines to be better indexed and to allow the latter to offer faster and more accurate results. 🤔 · 2 years ago
I knew that I was wrong about robot.txt!!! 🤣 But this is good!
I was thinking why, rather than using **machine learning** , can't we use **user learning** ?
On the capsule's root anyone can put a plain text file, for instance: search.txt. This is used to inform in which categories the capsule liked to be associate and in which categories it won't be; positive & negative keywords, etc...
I'd say from a min. of 3 up to a Max. of 5 entries for each section (not too many though) the people involved would include to feed the search engines in order to improve their ability to imply and offer better results. · 2 years ago
@freezr Thanks for reading :)
Yeah, that's definitely something that needs more consideration. Sorry I didn't answer it before in your other comment; I need to think it over more. However, I wanted to mention that afaik, robots.txt is mainly for telling crawlers a crawldelay (how frequent a crawler can crawl your site/capsule) and which things can be crawled. It's not a thing that provides metadata for files. However, there are sitemaps that can be used to provide extra metadata. But the problem is not everyone might use this, so search engines also need to work well in the case where there's little of this type of metadata or semantic markup. · 2 years ago
Yes, the new introduction provides a better understanding now. 👍
However I have difficult to figuring out how, without feeding before the crawlers, those can look for categories. Simple curiosity... 🤔 · 2 years ago