Posted on 2024-08-19
I'm a pretty bad server operator, all things considered. I never really take the time to know my services deeply and typically deploy things by following the most basic instructions online. It's probably not a good way to do things, but it is what it is.
I run a public Funkwhale server. I have done for years. I probably started it back in 2018-2019 and have kept it going ever since. I originally started it as a way to self-host my own music collection and to learn Docker, so it's basically been trucking along for the last half decade in Docker containers without too many issues.
A while ago, though, I noticed a really strange issue. I couldn't upgrade the API container. When I would try, I would see strange issues in the logs about collation and unique constraints. I troubleshot this issue with Georg and we discovered that - somehow - the index of one of the user tables had corrupted and a unique key violation had occurred. This was very strange. We looked into it more and more but could find no cause. Oh well. I removed the offending entry and reindexed the table. The upgrade worked and I thought nothing more of it.
Recently, the server has really been falling apart. Sessions have been logging out, the server would randomly stop serving requests, that sort of thing. I had a few users reach out to manually resolve issues with not being able to log in due to their email change not being confirmed when they clicked on the email link. All very strange, but I was just too burned out to really want to dig into it. I just assumed something in the upstream code was broken and there wasn't much I'd be able to do to help it since I really don't understand the backend code.
But then I saw that same telltale thing when I tried to update. A unique key constraint violation. I opened up the postgres container and tried to run a reindex. The tags_tag table had 278 duplicated tags with thousands of attached tagged items. Horror.
The other thing I noticed when I was trying to write a query to resolve this issue was the following warning:
database "postgres" has no actual collation version, but a version was recorded
Now, I don't understand Postgres at all, really, but I do remember Georg commenting on this before. I did a bit of searching and came across the following command to fix it:
ALTER DATABASE postgres REFRESH COLLATION VERSION;
When I ran this, however, all I got back was:
invalid collation version change
Bugger.
I thought I'd do some digging. What it comes down to is that - due to my inexperience - I was using a system to automate the upgrade of Postgres which mixed libc types (I run the container on Alpine, but the upgrader runs on Ubuntu). This was causing a lot of weirdness. I decided to take the plunge and do a full backup and restore of the database to allow Postgres to sort itself out and luckily it did just that. For the first time in goodness knows how long, there are no postgres errors in my logs.
Long story short: take the time to learn about what you're running and don't take too many shortcuts. They come back to bite you hard.