💾 Archived View for geminispace.info › news captured on 2022-06-11 at 20:35:33. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-06-03)
-=-=-=-=-=-=-
The poor performance on "no result searches" was caused by some misbehaviour when trying to compute suggestions for alternate search terms which eventually led to an exception.
I disabled suggestions for empty search results for the moment. Suggestions will come back once i sorted this out.
there's currently an issue with search querys that will lead to no results (e.g. geminispace.info can't find a page that matches the criterias):
These searches will take a very long time to a "no results" is returned, sometimes they will even fail with a "42 TEMPORARY FAILURE".
Any search for a known pattern will return the results within seconds, so expect that geminispace.info does not know about a page that matches your criteria if the search takes more then 20 seconds.
We are looking into this.
Our crawl engine is now multi-threaded. This means that multiple requests are made in parallel and the overall crawl time is greatly reduced.
Additionally the crawling is now more random, which should avoid requesting huge amounts of pages from a single capsule in a short time.
It seems like we've finally solved our memory issue. In the end it may have been a small parameter for whoosh which ended up loading the whole index into RAM.
At a first glance this didn't cause any performance drain, it even seems the system is more responsive now. Maybe due to the high memory pressure causing overhead.
The server running geminispace.info has been updated to Debian 11.3 without any issues. :)
With some small tweaks to the indexing process and the removal of old, now defunct, capsules which we still tried to crawl reduced the time needed for a complete update dramatically.
We had an outage due to a dependency upgrade that hit late. `markupsafe`, which is not used by geminispace.info but rather is a dependency of `jinja`, shipped a breaking change in a minor release which caused some trouble for various people. We were just late to the party.
It's workarounded for the moment, will have a look at it later.
geminispace.info allows now more variants of TLS ciphers which hopefully will allow us to crawl even more capsules.
geminispace.info is now monitored (and i will be alerted if something goes wrong) by shit.cx. Big thanks to Jon for providing this service.
So the last refactor went...erm...upside down. We had a outage for a few hours because of this.
I rolled the changes back and will do another attempt for a (hopefully successfull) refactor in the next days *fingers crossed*
I've blocked two ips for repeatedly doing stupid requests again and again:
2001:41d0:302:2200::180
193.70.85.11
Today one year ago geminispace.info has been set up. You probably guess what happened: the cert for the capsule expired today... :-D
A new cert is in place which now lasts for ten years...
geminispace.info is performing pretty well at the moment, it's reasonably fast and very reliable.
There was no need for me to hack around something, although a few optimizations are still open. I gonna tackle this todos in the next year.
The "newest-hosts" page now shows the 30 newest host instead of only 10.
I'm currently quite happy with the reliability and performance of the crawl and indexing processes.
So i removed some older excludes, you should expect to see a whole lot more indexed pages after the next crawl.
We'll have to see if i regret this change... ;)
geminispace.info is now powered by Debian 11 Bullseye :)
I just pushed a small fix that allows to search for backlinks without giving the mandatory scheme. The scheme is now automatically added.
I pushed a small change to production to ensure that URIs added to the seed requests include the scheme. This was mandatory before, but due to a recent change we no longer crawl schemeless URIs as per spec.
If you added your capsule in the last days without a scheme, this is now fixed and the capsule should be included in the index now.
Thanks to the contribution of Hannu Hartikainen geminispace.info now is again able to honor the user-agents "gus", "indexer" and "*" in robots.txt.
The revamped data store seems to work fine so far.
Unfortunately i had to disable the "newest hosts" and "newest pages" sites as the data is currently not available. I'll add that back again later, but before this i'd like to have the cleanup mechanismn implemented to get rid of old data from capsules that are no longer available.
If finally managed to analyze the index process. In the end it turned out to be an issue when calculating the backlink counters and with an adapted query indexing is fast again.
Obviously i was horribly wrong all the time blaming the slow vps.
Unfortunately this is only a small step in the major overhaul of GUS.
More trouble along the way. Although the VPS hosting geminispace.info runs with 8 Gigs of RAM and does not serve other services, the index update got oom-killed. :(
Seems due to the continued growth of gemini we are hitting the same problems Natalie hit a few months ago on GUS. I'm currently unsure about the next steps.
It took almost ten days the last reindex to complete as i triggered a complete index. This was necessary after the cleanup as there is currently no incremental cleanup of the search index implemented.
The design of GUS - which clearly has never been meant to index such a huge number of capsules - and the slow VPS are doing no good currently to keep the index up to date. Unfortunately we are currently stuck with the VPS.
Currently there is no progress to be reported on the coding site. I'm busy with various other things and late in the evening i can't bother to tackle some of the obvious tasks to improve GUS. If you are interesting in helping out improving GUS/geminispace.info feel free to comment on one of the issue or drop me a mail.
issues and todos of geminispace.org
I've made some manual cleanup of the base data the last days. This decreased the raw data size from over 3 GB to roughly 2 GB. Unfortunately a new mirror of godocs came online...another thing we need to exclude for the moment.
geminispace.info runs rather stable the last weeks, but i added it to my external status monitor anyway:
external status monitor (web only currently :( )
It will alert me if it goes down.
No news on the coding site currently. Other projects occupy the time that i can currently devote to tech stuff.
issues and todos of geminispace.org
We'll have a few days off, i'll get back to some coding after that.
geminispace.info is now aware of more than 1000 capsules. Unfortunately this data is somewhat misleading: some of the capsules may already be gone, but GUS lacks a mechanism for invaliding old data.
I'll probably start with some manual cleanup the next days, so don't worry if numbers go down.
We are back on track with crawl and index, everything is up-to-date again.
I had to add another news and a wikipedia mirror to the exclude list. The current implementation can't handle such a huge amount of information well.
Obviously this didn't work as expected. For whatever reason indexing fails repeatedly on one or another page with a mysterious sqlite error. It may to a few days till i find enough time to search for the cause of this error.
If you are familiar with peewee and sqlite or have come across this issue earlier, let me know:
Here's the issue related to this error on src.clttr.info
The index is currently a few days behind. It will hopefully catch up during the day.
From now on I will exclude any sort of news- or youtube-mirrors from the crawl without further notice.
For the sake of transparency i may add a section which mention what is excluded and why it is excluded. But this is not a high priority for me.
There are currently some issue during crawl that sometimes lead to n interruption. So it may take more then the usual 3 days until new content is discovered.
This will eventually be solved when the migration to PostgreSQL is done, unfortunately im quite busy with real life currently so it may take some time.
I started working on migrating the backing database to PostgreSQL instead of SQLite.
This may take a while, but it will eventually solve some of the problems that currently occur around crawling and indexing.
Not sure if i can keep the updates schedule set on every 3 days.
Current crawl is running for more than 24 hours now and it's still not finished yet.
The shady workaround is now in place - index updates won't block searches anymore.
This is even more important with the ongoing growth of geminispace - as of today there are more then 750 capsules we know about.
I'm currently working on a workaround to avoid the index update blocking search requests.
Unfortunately i broke the index during this...need to be more careful when doing maintenance.
I've made some adjustments on how GUS/geminispace.info uses robots.txt.
Previously we tried to honor the settings for *, indexer and gus user-agents. That didn't work out well with the available python libraries for robots parsing and GUS ended up crawling files it wasn't intended tto.
We now only use the settings for * and indexer, no special handling for GUS anymore. All indexers unite. ;)
The first fully unattended index update has happened last night.
There are still some rough edges to be cleaned, but we are on the way to have up-to-date search results without manual intervention.
geminispace.info has just been announced on the gemini mailing list.
geminispace.info is going public! Yeah! :)
test drive of instance search.clttr.info started