2020-08-29 Restarting Phoebe

I have trouble restarting Phoebe and other services I run. I guess I don’t understand how these forking processes work. These processes fork for every request they get, and that works. When they get a lot of requests, however, they enter some sort of failed state. And by “a lot” I mean more than a hundred or so. I know that it’s not a lot but I want these services to work for the small net and so I don’t want to spend any effort in trying to make them handle more. I don’t want to start caching HTTP requests, for example. What irks me most of all is that this doesn’t happen because hundreds of visitors want to know about my stuff. No, I share a link on Mastodon, and my post gets federated, and then every single server on the fediverse tries to get a preview image to display.

2020-08-16 Mastodon kills Gemini Wiki

I’ve solved this issue for all my services behind Apache by blocking all the fediverse user agents, but Phoebe runs without a web server front-end. I’ve tried to abort as soon as possible, using the same regular expression, but that doesn’t seem to work.

But I have a second layer of defence: Monit watches over my processes. Please forgive the *huge* start program option. Maybe it’s time to move some of that into a config file. Please just scroll down. I also didn’t want to shorten it, because I think it’s an interesting snapshot of a non-trivial Phoebe setup.

Let’s go through this.

First we have a PID file, where the process ID is going to be. This is how Monit identifies the parent process responsible for the service. The --pid_file option is what tells Phoebe to write the same file. So far so good.

check process phoebe with pidfile /home/alex/farm/phoebe.pid
    start program = "/usr/bin/perl -I/home/alex/phoebe/lib /home/alex/farm/phoebe
 --setsid --user=alex --group=alex
 --log_level=3 --log_file=/home/alex/farm/phoebe.log
 --pid_file=/home/alex/farm/phoebe.pid
 --wiki_dir=/home/alex/phoebe
 --host=transjovian.org --cert_file=/var/lib/dehydrated/certs/transjovian.org/fullchain.pem --key_file=/var/lib/dehydrated/certs/transjovian.org/privkey.pem
 --host=toki.transjovian.org --cert_file=/var/lib/dehydrated/certs/transjovian.org/fullchain.pem --key_file=/var/lib/dehydrated/certs/transjovian.org/privkey.pem
 --host=vault.transjovian.org --cert_file=/var/lib/dehydrated/certs/transjovian.org/fullchain.pem --key_file=/var/lib/dehydrated/certs/transjovian.org/privkey.pem
 --host=communitywiki.org --cert_file=/var/lib/dehydrated/certs/communitywiki.org/fullchain.pem --key_file=/var/lib/dehydrated/certs/communitywiki.org/privkey.pem
 --host=alexschroeder.ch --cert_file=/var/lib/dehydrated/certs/alexschroeder.ch/fullchain.pem --key_file=/var/lib/dehydrated/certs/alexschroeder.ch/privkey.pem
 --host=next.oddmuse.org --cert_file=/var/lib/dehydrated/certs/oddmuse.org/fullchain.pem --key_file=/var/lib/dehydrated/certs/oddmuse.org/privkey.pem
 --wiki_main_page=Welcome --wiki_pages=About
 --wiki_mime_type=image/png --wiki_mime_type=image/jpeg
 --wiki_mime_type=audio/mpeg
 --wiki_space=transjovian.org/test
 --wiki_space=transjovian.org/phoebe
 --wiki_space=transjovian.org/gemini"

OK, with that out of the way, let’s talk about the important stuff: stopping and restarting the process, and determining when to restart the process.

    # leave enough time after a stop for the server to recover before starting
    stop program = "/bin/bash -c 'kill -s SIGKILL `cat /home/alex/farm/phoebe.pid`; sleep 120'"
    if failed
	host transjovian.org
	port 1965
	type tcpssl
        send "gemini://transjovian.org:1965/\r\n"
	expect "20 .*"
	for 5 cycles
	then restart
    if totalmem > 100 MB for 5 cycles then restart
    if 6 restarts within 15 cycles then stop

Monit checks the service using a regular request, once every cycle (5min). If it fails five times in a row (25min), it restarts. It also restarts when total memory is more than 100MB five times in a row. And when it had to restart six times in 15 cycles (75min), then the process gets stopped.

What happens on a restart? First the program is stopped and then it is started. But yesterday for example:

[CEST Aug 29 00:39:05] error    : 'phoebe' total mem amount of 190.0 MB matches resource limit [total mem amount>100 MB]
[CEST Aug 29 00:39:05] info     : 'phoebe' trying to restart
[CEST Aug 29 00:39:05] info     : 'phoebe' stop: '/bin/bash -c kill -s SIGKILL `cat /home/alex/farm/phoebe.pid`; sleep 120'
[CEST Aug 29 00:39:35] info     : 'phoebe' start: '/usr/bin/perl -I/home/alex/phoebe/lib /home/alex/farm/phoebe --setsid --user=alex --group=alex --log_level=3 --log_file=/home/alex/farm/phoebe.log --pid_file=/home/alex/farm/phoebe.pid --wiki_dir=/home/alex/phoebe --host=transjovian.org --cert_file=/va...'
[CEST Aug 29 00:44:42] error    : 'phoebe' process is not running
[CEST Aug 29 00:44:42] info     : 'phoebe' trying to restart
[CEST Aug 29 00:44:42] info     : 'phoebe' start: '/usr/bin/perl -I/home/alex/phoebe/lib /home/alex/farm/phoebe --setsid --user=alex --group=alex --log_level=3 --log_file=/home/alex/farm/phoebe.log --pid_file=/home/alex/farm/phoebe.pid --wiki_dir=/home/alex/phoebe --host=transjovian.org --cert_file=/va...'
[CEST Aug 29 00:49:57] error    : 'phoebe' process is not running
[CEST Aug 29 00:49:57] info     : 'phoebe' trying to restart
[CEST Aug 29 00:49:57] info     : 'phoebe' start: '/usr/bin/perl -I/home/alex/phoebe/lib /home/alex/farm/phoebe --setsid --user=alex --group=alex --log_level=3 --log_file=/home/alex/farm/phoebe.log --pid_file=/home/alex/farm/phoebe.pid --wiki_dir=/home/alex/phoebe --host=transjovian.org --cert_file=/va...'
[CEST Aug 29 00:55:02] error    : 'phoebe' process is not running
[CEST Aug 29 00:55:02] info     : 'phoebe' trying to restart
[CEST Aug 29 00:55:02] info     : 'phoebe' start: '/usr/bin/perl -I/home/alex/phoebe/lib /home/alex/farm/phoebe --setsid --user=alex --group=alex --log_level=3 --log_file=/home/alex/farm/phoebe.log --pid_file=/home/alex/farm/phoebe.pid --wiki_dir=/home/alex/phoebe --host=transjovian.org --cert_file=/va...'
[CEST Aug 29 01:00:12] error    : 'phoebe' process is not running
[CEST Aug 29 01:00:12] info     : 'phoebe' trying to restart
[CEST Aug 29 01:00:12] info     : 'phoebe' start: '/usr/bin/perl -I/home/alex/phoebe/lib /home/alex/farm/phoebe --setsid --user=alex --group=alex --log_level=3 --log_file=/home/alex/farm/phoebe.log --pid_file=/home/alex/farm/phoebe.pid --wiki_dir=/home/alex/phoebe --host=transjovian.org --cert_file=/va...'
[CEST Aug 29 01:05:19] error    : 'phoebe' process is not running
[CEST Aug 29 01:05:19] info     : 'phoebe' trying to restart
[CEST Aug 29 01:05:19] info     : 'phoebe' start: '/usr/bin/perl -I/home/alex/phoebe/lib /home/alex/farm/phoebe --setsid --user=alex --group=alex --log_level=3 --log_file=/home/alex/farm/phoebe.log --pid_file=/home/alex/farm/phoebe.pid --wiki_dir=/home/alex/phoebe --host=transjovian.org --cert_file=/va...'
[CEST Aug 29 01:10:25] error    : 'phoebe' service restarted 6 times within 6 cycles(s) - stop

Why isn’t the process running? Here’s a selection from the other log:

2020/08/29-00:39:38 App::Phoebe (type Net::Server::Fork) starting! pid(19496)
2020/08/29-00:39:38 Cannot connect to SSL port 1965 on 178.209.50.237 [Address already in use]
2020/08/29-00:44:43 App::Phoebe (type Net::Server::Fork) starting! pid(20808)
2020/08/29-00:44:43 Cannot connect to SSL port 1965 on 178.209.50.237 [Address already in use]
2020/08/29-00:49:59 App::Phoebe (type Net::Server::Fork) starting! pid(22125)
2020/08/29-00:49:59 Cannot connect to SSL port 1965 on 178.209.50.237 [Address already in use]
2020/08/29-00:55:05 App::Phoebe (type Net::Server::Fork) starting! pid(23449)
2020/08/29-00:55:05 Cannot connect to SSL port 1965 on 178.209.50.237 [Address already in use]
2020/08/29-01:00:15 App::Phoebe (type Net::Server::Fork) starting! pid(27704)
2020/08/29-01:00:15 Cannot connect to SSL port 1965 on 178.209.50.237 [Address already in use]
2020/08/29-01:05:21 App::Phoebe (type Net::Server::Fork) starting! pid(15897)
2020/08/29-01:05:21 Cannot connect to SSL port 1965 on 178.209.50.237 [Address already in use]

When I returned to the server this morning, that was the state it was in, and when I tried to restart it, same problem. There was a process still running, but the PID file was gone, and so Monit couldn’t stop the process, but the process was also not serving the port it was using.

ps aux | grep phoebe

So what’s the best solution, here? I’m thinking of a variant of “killall”, perhaps? The example below uses “[p]hoebe” to avoid the grep command from listing itself.

ps aux | grep '[p]hoebe' | awk '{print $2}' | xargs kill

What do you think?

​#Monit ​#Phoebe ​#Gemini ​#Administration