💾 Archived View for station.martinrue.com › krixano › c1bb0154b30a438f8ccda5306e46bc57 captured on 2024-02-05 at 12:22:33. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-07-22)
-=-=-=-=-=-=-
@acidus Just wanted to say, your Kennedy search engine is really cool! I started my own search engine a year ago, but I've been completely failing keeping it running and maintaining it. It stalled early on because I was getting frustrated with the querying being too slow. I'm just now starting to improve it. I really liked that you added Gemipedia search directly into the search results. I really like that feature.
It seems like my search engine is a difference purpose from yours so maybe there's room for two different types of search engines :D
2 years ago · 👍 mozz_iphone
@acidus Btw, it's fine. I could very well have misunderstood you. But...
link text often is a better descriptor of a page than analyzing the content of the page or its metadata.
I think you underestimate the information that you can gather from just the content of a page. If you only have FTS and often-bad-or-incorrect metadata, then I could see how one would dismiss it easily. But, linguistics is one of my passions, and I've created NLP parsers (for English) before. In fact, one of my projects (called Orpheus) stores any natural language in an AST and can re-construct it back into the language (and can translate to other languages, sorta). · 2 years ago
@acidus All that I'm tring to get across is that if I reject an idea, it's because I have a legitimate reason for it. This doesn't mean I'm not looking at other ideas. I literally just looked at HITS, lmao.
There are many ways to do searching. Some searches prioritize what they call "authority". Some prioritize *new content* over "authority". Some prioritize popularity (which is not necessarily the same as authority, but can be). Looking at link statistics is a legitimate way of doing it, but is produces certain types of results that don't work for other types of results that one might want. · 2 years ago
I’m not trying to dismiss or diminish anyone. I am sincerely sorry if that’s how I came across. it’s very late where I am and this is a crude way to share what I’m thinking · 2 years ago
@acidus Well, you shouldn't assume that I'm just dismissing work for no reason. You clearly don't respect the knowledge or reasoning that I have for you to do this. Why? Because I'm just the person who makes simple proxies on Gemini? It seems like you're vastly underestimating my abilities. · 2 years ago
@acidus Well, I didn't dismiss work out of hand. I looked at HITS, but the number-one principle for my search engine is not taking into account statistics like the number of links to a page, because that has been done a million times by like every search engine ever.
As soon as I read this, it was interesting, but I knew that I wasn't going to use HITS, because it's instantly something that I disagree with.
In other words, a good hub represents a page that pointed to many other pages, while a good authority represents a page that is linked by many different hubs. · 2 years ago
sorry if this is coming across rudely. you’re creating some really cool stuff and I hope you keep doing that. I think there’s some existing work that you can learn from and be inspired by as well. That’s the crux of what I’m trying to say · 2 years ago
🤷 links are just a signal that can be used in different ways. extracting meaning of content can be challenging and some aspects of links can help.
i’m not saying only following peoples footsteps. doing things your own way is something I believe in too. and sometimes trying to redescover things from first principles is fun and can lead to interesting places.
my point was is this is a big field where hackers academics and non-CS people have all approached parts of it in new interesting and unique ways. don’t blindly copy them, but don’t dismiss their work out of hand completely either. · 2 years ago
@acidus There's also an implicit assumption in things like HITS - the user desiring content that is commonly viewed or quoted vs. discovering new content that nobody has quoted or linked to, we could call that reverse-HITS, idk, lol. · 2 years ago
Yep you just independently discovered something that was researched and published in the early 90s: link text often is a better descriptor of a page than analyzing the content of the page or its metadata.
@acidus I also independently discovered this idea in like 5 minutes. I had this idea from probably like the first day I started the search engine. Discovering new ideas for search engines is not as hard as people make it out to be. The harder part is coming up with algorithms that are fast. · 2 years ago
@acidus What I mean is relevance by content is not the same as relevance by authority. The former looks only at content, while the latter looks at content and statistics (how many links link to this one page, etc.). · 2 years ago
@acidus I disagree that relevance of content is *solely* what all of the search engines are trying to do. HITS and PageRank are trying to do relevance by "authority", or make an assumption that relevance should be based on popularity or amount of links to something. I disagree that these would be effective though. They sound more like band-aid fixes.
Btw, I don't follow the "don't reinvent the wheel" mantra. If I discovered something that was already done, then it means that the thing that was already done matches with what I think is good. There's something to be said for going a completely different path from where others have gone. · 2 years ago
“One idea I had was storing the link titles and using those as like "alternative titles" for the pages they link to. That way you can use them for searching.”
Yep you just independently discovered something that was researched and published in the early 90s: link text often is a better descriptor of a page than analyzing the content of the page or its metadata.
i’m not trying to poke fun at you. but you don’t have to re-invent everything from scratch.I bet Reading about IR and some of the early web papers will spark a lot of new ideas for you. · 2 years ago
“ relevance of the content“ is exactly what all these systems are trying to do 😂
“return accurate results for a search of a corpus”. this is called “Information retrieval” and it is an enormous field which started at the beginnings of computer science with big names like Claude Shannon contributing foundational work.
Content. metadata. hyperlinks. authority. accuracy vs precision. it’s all part of it. the IR textbook from Stanford is a available for free in PDFs and HTML. web search is only 2 chapters
I think you would really enjoy it. Check out the Wikipedia entry · 2 years ago
@acidus Just looked at HITS for a bit. I don't want to rate on popularity for my search engine, so I guess that's where mine distinguishes from others as well. I want mine to be purely based on relevance of content. I also find FTS to be too limiting in certain areas, unless there's like fancy fuzzy searching or whatever. The different approaches are interesting, and I think there's room for the use of all of them, because they all do things in different ways that may lend itself to different uses.
One idea I had was storing the link titles and using those as like "alternative titles" for the pages they link to. That way you can use them for searching. · 2 years ago
@acidus One thing I liked that all the new search engines did after GUS/geminispace was actually storing the titles of pages. My search engine takes this even further by gettign the metadata for audio files, as well as pdfs and djvu files now too. I think it really improves search results by looking at metadata like this.
I am not aware of HITS, but I'll look into it, thanks. · 2 years ago
also, thank you for the kind words. it’s fun to build and I need to get back to hacking on it · 2 years ago
I love the diversity of search engines in Gemini so go for it. for example @haze/TLGS was (is?) using HITS, which is a cool approach
gemini://gemi.dev/cgi-bin/wp.cgi/view?HITS+algorithm
I had started to look at PageRank, but honestly the link density of Gemini space is low compared to the web, so it’s less helpful. i moved to a simple logarithmic popularity weight on top of the FTS results but honestly that makes Kennedy slower and im not sure if the benefits is worth it. @haze should get bonus points for writing TLGS in C!🤯 · 2 years ago
@haze FYI, I use firebird and FTS was only *just* released last month as an extension. Anyways, I kinda despise being told I *have* to do things exactly like every other thing, because it's boring and isn't innovative :D
Kennedy, TLGS, and GUS already do FTS. We don't need Yet Another FTS Search Engine, lol. I want to do things differently. My search engine has always been experimental like that. · 2 years ago
@haze I don't store the full contents of pages - in fact, my search engine *never* did this. Previously, it did keyword extraction. However, I have removed the keyword extraction for the moment because it didn't produce the best results. I have, however, replaced it with tags (and soon, mentions), which tend to be short and therefore can have an (db) index.
I will look into FTS, however, I doubt I am going to use it because I don't like the idea of storing all of geminispace locally, at all. Instead, I would rather find better methods of indexing content - like better keyword extraction methods, and maybe a few other things. · 2 years ago
@krixano You need to use the Full Text Search capablity provided by your RDBMS. GUS, Kennedy, TLGS all uses it. It drops the the time complexity of text matching from O(N) to basically O(log N). · 2 years ago