2024-10-27 Upgrading GoToSocial from 16.0 to 17.1

Here's something to watch out for, if you're like me: Disable all the infrastructure that watches over your processes. In my case, the problem was Monit. It checks the website every five minutes and if it fails to connect for three times in a row it restarts the server, breaking the migration. 😭

systemctl stop gotosocial
# prevent systemctl from restarting it
systemctl disable gotosocial
# prevent monit from interrupting the migration with a restart!
monit unmonitor gotosocial
# backup!
mkdir backup
cp sqlite.db backup/

Now you're ready to extract the new version over the old one, compare your config file with the example provided, and start it again.

systemctl enable gotosocial
systemctl start gotosocial
journalctl --unit gotosocial --follow

Don't be like me and start Monit because my Monit config checks the URL every five minutes and restarts GoToSocial if the site is not up. Which is a big problem if migration takes more than a handful of minutes.

I ended up with a borked migration restart loop and ended up stopping it all again, overwriting the borked database file with the backup, and redoing it.

#Administration #GoToSocial

*2024-10-28**. Another thing to note for the GoToSocial upgrade is that I ran 16.0 using a systemd MemoryMax of 200M; today the upgraded instance with 17.1 ran fine for a while and then locked up. A restart didn't bring it back. It remained stuck after a log message saying "compiling WebAssembly". I increased MemoryMax to 300M, no change. I increased it to 500M and the instance came up. Just in case you're as memory-stingy as I am...

In order to avoid future compilation, @dumpsterqueer@superseriousbusiness.org pointed me at this:

You can instruct GoToSocial on where to store the Wazero artifacts by setting the environment variable `GTS_WAZERO_COMPILATION_CACHE` to a directory, which will be used by GtS to store two smallish artifacts of ~50MiB or so each (~100MiB total). – Configuration Overview

Configuration Overview

I'll try that.

It looks like a side-effect of GoToSocial implementing the direct messages API is that the Toot! App I'm using is showing me all my former direct messages using it's special user interface (those bubbles on the right hand side). I have to open every single one of them to dismiss it. 🤨

*2024-10-29**. Today I read that the botsin.space instance was shutting down. I figured I might start thinking about creating a second account for my blog on my own instance. I tried to run `./gotosocial admin account create` a few times, forgetting this or that parameter. And then I noticed that the replies I saw scrolling by always ended in an error message. In fact, there were more such error messages in my log files: "database disk image is malformed" 😱

The `.recover` command didn't work when I tried it:

# sqlite3 sqlite.db ".recover" | sqlite3 new.db
sql error: SQL logic error (1)

So then I tried the following:

monit unmonitor gotosocial
systemctl stop gotosocial
sqlite3 sqlite.db ".dump" > db.sql
mkdir backup
mv sqlite.db backup/
sudo -u gotosocial sqlite3 sqlite.db < db.sql
gzip backup/sqlite.db
gzip db.sql

Some errors that I saw:

A few lines about accounts with no account_uri even though that was a NOT NULL column.

Many, many such lines:

no such table: sqlite_stat4

Then this one:

NOT NULL constraint failed: conversations.thread_id (19)

I started to feel bad about the whole thing.

I aborted the operation. The gzip command hadn't finished, yet. I restored the old database file.

mv backup/sqlite.db .
systemctl start gotosocial

As it turns out, now my GoToSocial instance seems to be unreachable. The service starts, `htop` shows processes churning. The log shows i/o timeouts and "No Content: wrote 0B" log messages scrolling by. Oof! 😓

Looking at the timestamps again, it seems that the recovery command left a `sqlite.db-shm` and a `sqlite.db-wal` file in place.

-rw-r--r--     1 gotosocial gotosocial 10445488128 29. Okt 22:47 sqlite.db
-rw-r--r--     1 gotosocial gotosocial       32768 29. Okt 23:19 sqlite.db-shm
-rw-r--r--     1 gotosocial gotosocial      341992 29. Okt 23:19 sqlite.db-wal

That can't be right. So I'm going to stop `gotosocial`, move these two files away, and start it again.

Sadly, no luck.

Perhaps there is a database recovery going on? I can't tell. This time around I see the typical startup messages, something about "recovered queued tasks", about 12 requests that look like regular requests, and then nothing.

I'll let it run for a bit.

I restarted it again. It seems to work?

*2024-10-30**. The database is still corrupt in some way. There are a lot of errors. Here are two examples:

error dereferencing remote status … : enrichStatus: failed to dereference status author … : enrichAccount: error putting in database: sqlite3: database disk image is malformed (code=11 extended=11)

0xc0091c61e0: error processing: CreateAnnounce: error dereferencing announce: EnrichAnnounce: error fetching boost target … : enrichStatus: failed to dereference status author … : enrichAccount: error putting in database: sqlite3: database disk image is malformed (code=11 extended=11)

There's something about these authors that's not working.

The code in `account.go`:

// This is new, put it in the database.
err := d.state.DB.PutAccount(ctx, latestAcc)
if err != nil {
	return nil, nil, gtserror.Newf("error putting in database: %w", err)
}

I feel that this is where things are going wrong. Something about the accounts table.

I'm going to make an offline copy of the `sqlite.db` file. Sadly the `.recover` doesn't work on my laptop, either.

$ sqlite3 sqlite.db ".recover" > data.sql
sql error: SQL logic error (1)

Not looking good! I'm going to try the dump.

sqlite3 sqlite.db ".dump" > data.sql
sqlite3 recovery.db < data.sql 2>&1 |tee recovery.log

Let's look at the log file and list the errors!

+------------+---------------+--------------------------------+
| Occurences |     Type      |             Error              |
+------------+---------------+--------------------------------+
|        454 | Runtime error | UNIQUE constraint failed:      |
|            |               | media_attachments.id           |
|         69 | Runtime error | NOT NULL constraint failed:    |
|            |               | accounts.uri                   |
|       2111 | Parse error   | no such table: sqlite_stat4    |
|          1 | Runtime error | NOT NULL constraint failed:    |
|            |               | conversations.thread_id        |
+------------+---------------+--------------------------------+

I ended up filing an issue.

an issue

And then, later that day, I used `.dump`. This time around, there was a `COMMIT` at the end of the dump, so no change was required.

sqlite3 sqlite.db ".dump" > data.sql
tail data.sql  # verify that there is a COMMIT at the end
sqlite3 recovery.db < data.sql 2>&1 |tee recovery.log
rsync --archive --itemize-changes recovery.db "sibirocobombus.root:/home/gotosocial/sqlite.db"

The recovery log showed all the errors mentioned above, and I used the new database anyway.

*2024-10-31**. Currently the instance is locking up every few minutes, as far as I can tell. 😰

*2024-11-06**. The instance has been stable these days!

*2025-03-03**. Tried to migrate to v0.18.0 in February but failed; see #3788 for more. Now I'm trying again. Key differences:

#3788

no more memory limitations on the service;
don't use Monit to start the service because migration takes so long that it'll interfere with restarts.

To illustrate: I just saw in my log that `alter table statuses drop column visibility` took 31 min.

This time it took from 8:30 until 15:40 for the migration. More than seven hours!

Everything is slow. Unbearably slow. Semaphore claims the Internet is down and shows me cached posts. `toot tui` gives me an exception. But perhaps, slowly, things improve. I hope that this is the 7h backlog that needs to go through.

2024-10-27-upgrade-gotosocial-1.jpg

Perhaps the problem is the age of my SQLite?

alex@sibirocobombus ~> sqlite3 -version
3.40.1 2022-12-28 14:03:47 df5c253c0b3dd24916e4ec7cf77d3db5294cc9fd45ae7b9c5e82ad8197f3alt1

2024-10-27-upgrade-gotosocial-2.jpg

2024-10-27-upgrade-gotosocial-3.jpg

@dumpsterqueer@superseriousbusiness.org suggested running a manual ANALYZE and so I did:

root@sibirocobombus /h/gotosocial# sudo -u gotosocial sqlite3 sqlite.db 
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints.
sqlite> PRAGMA analysis_limit=0; ANALYZE;
0
sqlite> .quit

Surprisingly, this ran in less than five minutes! The ANALYZE that was part of migration ran for 1h and 50 min.

At the moment I don't see much difference, but who knows.

*2025-03-04**. The maintainers were super nice and we discussed various options in #3872, investigated upstream and finally found that my virtual machine didn't support a particular feature that the WASM compiler required, so the WASM code was being interpreted instead and therefore very slow. To confirm this, check your WASM cache (which you have to enable using an environment variable in your systemd unit definition). In my case the directories were being created but no files were being saved. The compilation failed silently. Or if there was something in the logs, I missed it.

#3872

The first fix was to use the "nowasm" build offered by GoToSocial. It worked! It's a good solution because I have `ffmpeg` and `sqlite3` libraries installed.

One avenue the maintainers wanted to explore was a CPU feature that wasn't optional: SSE4.1. To check, look at `/proc/cpuinfo`. Does it expose `sse4_1`? Mine did not.

When I contacted my hosting provider support I got the help required, however. It was a setting I had to switch on and then power cycle the virtual machine. The support person even hopped onto the issue tracker to tell others about it. And with that, the problem was fixed! `/proc/cpuinfo` shows the `sse4_1` flag and I'm back on the regular GoToSocial build. 🥳