💾 Archived View for station.martinrue.com › marginalia › 243a7b7227f240d7a886c69b0758e57d captured on 2023-04-26 at 14:46:46. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-01-29)

➡️ Next capture (2024-05-12)

-=-=-=-=-=-=-

👽 marginalia

Added a gemini ingress to my search engine for websites, results are ordered by how little javascript and markup they use. Like... a reverse SEO search engine: gemini://marginalia.nu/search?gemini -- will make it crawl gemini-space as well in the foreseeable future, but until then, enjoy exploring the more obscure corners of the big web.

2 years ago · 👍 kevinsan, lykso

Links

[1] gemini://marginalia.nu/search?gemini

Actions

👋 Join Station

12 Replies

👽 kevinsan

Yes, I found that Lucene couldn't build partial indexes without ultimately requiring storage of twice the index size while it writes temporary files. Storage on VPS is relatively expensive. · 2 years ago

👽 marginalia

@kevinsan The index software itself is custom built, basically just memory mapping files full of integers and doing magic. I did consider existing options like elasticsearch and what have you, but they can't handle nearly this big of an index on this little hardware (and also have way more features to make up for it). I'm offloading some metadata on mariadb, but it's mainly used to help orchestrate the crawling. · 2 years ago

👽 kevinsan

what's your index/search software? · 2 years ago

👽 marginalia

@kevinsan I don't particularly mind the idea of bots as long as there's some attribution. The query server can take a pretty decent pounding. It's a bit slow now because the indexes aren't fully cached, but I've stress tested it against 100 rpsand it didn't even seem to affect performance. Only thing I don't know how much of uptime I can guarantee. Every time I regnerate the index it's going to have no/little content for half a week (I don't do that super often, only when introducing major code changes). I get what you mean with the search box being tricky: I might add a random button, I think I've found a way of constructing a query that doesn't absolutely murder the database. · 2 years ago

👽 kevinsan

btw, the reason I ask is that a 'Search' box leaves me stumped on what I want to search for. I'm sure plenty people get that feeling. Search is great when I have a specific question, but I want a system that offers me interesting ways to browse around. · 2 years ago

👽 kevinsan

15 million is great! remember that Google's early index sizes were ~20M, and that was a systematic crawl with nothing discarded. What's your stance on bots being built on your search functions? · 2 years ago

👽 marginalia

@kevinsan The current index is only about 15 million. I've gotten it up to, I think 800 mil, but there comes a point around a 100-200 where it just increases the amount of low quality results that make it high in the ranking. So if I want to make it that far, I need to seriously hone my junk-detection algorithms. · 2 years ago

👽 kevinsan

@marginalia yes, be ruthless. even a billion URLs would probably be too many! How many URLs are in the index at this point in time? · 2 years ago

👽 marginalia

@lykso I might tweak the definition of what constitutes a word to allow integers or even constructs like abc-NNN or abc(-abc)?(NNN)?, ideally I wanna capture cases like plan 9, day-Z, r2d2, etc without matching against UUIDs and serial numbers. But I probably won't support arbitrary string searches anytime soon, I won't say never but right now I'm this fast on consumer hardware because I can get away with backing the queries with doing a series of binary searches in integers. It may well be technically feasible, but that's going to be one big suffix tree (or something) I'm having to construct. · 2 years ago

👽 marginalia

@kevinsan It's just like... a couple of months of tweaking to get it this neat. I'm extremely judicious with what I allow in the index, since I will realistically not be able to grow it much farther than 1 billion URLs. Like there's TLDs you won't touch subdomains in, because removing those removed 80% of link farms. Sadly I won't ever crawl for example cr.yp.to as a result, but it really is all down to making concessions. I won't index everything no matter what, so I try to make what I can get count the most. · 2 years ago

👽 lykso

That's lovely! I really like this approach.

Any chance you'll support phrases at some point? Tried searching "plan9" and "plan 9" without success. · 2 years ago

👽 kevinsan

This is awsome. Crawling's an edge-case nightmare, yet you've managed to maintain cleanliness in your index. If you add a random feature (just 20 random index pages), then you'd make serendipity even more likely! · 2 years ago