2019-07-04 Lost contact with the machine

A few minutes ago I noticed that I couldn’t reach my website. I tried to connect via SSH but got a timeout. I tried to start a VNC server via the admin console but got an error. I figured it must have gone totally bonkers and rebooted it, after more than 450 days of uptime. Now that it’s back up, I’m checking stuff but it looks like that wasn’t the problem. Time to look at the log files…

CPU looks OK

Load looks OK

Time to figure out when this happened! Let’s zoom into the very end, when Apache reports no more hits:

Apache logs showing a drop to zero after 9:00

Let’s look at the time between 9:00 and 9:10 in the logfile!

Jul  4 09:00:01 sibirocobombus CRON[13112]: (alex) CMD (/home/alex/bin/trunk-bot)
Jul  4 09:00:01 sibirocobombus CRON[13113]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul  4 09:00:01 sibirocobombus CRON[13114]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/
null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul  4 09:00:01 sibirocobombus CRON[13115]: (root) CMD (   test -x /usr/sbin/tigercron && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; nice -n$NICETIGER /
usr/sbin/tigercron -q ; })
Jul  4 09:02:01 sibirocobombus CRON[13592]: (logcheck) CMD (   if [ -x /usr/sbin/logcheck ]; then nice -n10 /usr/sbin/logcheck; fi)
Jul  4 09:05:01 sibirocobombus CRON[13627]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul  4 09:05:01 sibirocobombus CRON[13628]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/
null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul  4 09:06:01 sibirocobombus CRON[13938]: (alex) CMD (/home/alex/bin/trunk-bot)
Jul  4 09:07:01 sibirocobombus CRON[13940]: (alex) CMD (/home/alex/bin/planet-indie)
Jul  4 09:10:01 sibirocobombus CRON[13961]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul  4 09:10:01 sibirocobombus CRON[13962]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)

This is extremely and reassuringly boring. Nothing at all seems to have happened. Happy cron jobs!

I rebooted about ten minutes later:

Jul  4 09:18:49 sibirocobombus kernel: [    0.000000] ...

@fitheach asked about that spike around midnight. I remember seeing those on a regular basis and never bothered to look into it. Let’s check now!

@fitheach

Jul  3 23:55:01 sibirocobombus CRON[24281]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/
null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul  3 23:55:01 sibirocobombus CRON[24282]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul  3 23:57:25 sibirocobombus systemd[1]: Created slice User Slice of git.
Jul  3 23:57:25 sibirocobombus systemd[1]: Starting User Manager for UID 1001...
Jul  3 23:57:25 sibirocobombus systemd[1]: Started Session 394503 of user git.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Reached target Timers.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Reached target Paths.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Listening on GnuPG cryptographic agent (access for web browsers).
Jul  3 23:57:25 sibirocobombus systemd[24851]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jul  3 23:57:25 sibirocobombus systemd[24851]: Listening on GnuPG cryptographic agent and passphrase cache.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Jul  3 23:57:25 sibirocobombus systemd[24851]: Listening on GnuPG network certificate management daemon.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Reached target Sockets.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Reached target Basic System.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Reached target Default.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Startup finished in 50ms.
Jul  3 23:57:25 sibirocobombus systemd[1]: Started User Manager for UID 1001.
Jul  3 23:57:25 sibirocobombus systemd[1]: Stopping User Manager for UID 1001...
Jul  3 23:57:25 sibirocobombus systemd[24851]: Stopped target Default.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Stopped target Basic System.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Stopped target Timers.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Stopped target Paths.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Stopped target Sockets.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Closed GnuPG cryptographic agent and passphrase cache.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jul  3 23:57:25 sibirocobombus systemd[24851]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jul  3 23:57:25 sibirocobombus systemd[24851]: Closed GnuPG cryptographic agent (access for web browsers).
Jul  3 23:57:25 sibirocobombus systemd[24851]: Closed GnuPG network certificate management daemon.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Reached target Shutdown.
Jul  3 23:57:25 sibirocobombus systemd[24851]: Starting Exit the Session...
Jul  3 23:57:25 sibirocobombus systemd[24851]: Received SIGRTMIN+24 from PID 24867 (kill).
Jul  3 23:57:25 sibirocobombus systemd[1]: Stopped User Manager for UID 1001.
Jul  3 23:57:25 sibirocobombus systemd[1]: Removed slice User Slice of git.
Jul  3 23:57:51 sibirocobombus systemd[1]: Started Session 394505 of user alex.
Jul  3 23:57:59 sibirocobombus systemd[1]: Created slice User Slice of git.
Jul  3 23:57:59 sibirocobombus systemd[1]: Starting User Manager for UID 1001...
Jul  3 23:57:59 sibirocobombus systemd[24942]: Listening on GnuPG cryptographic agent and passphrase cache.
Jul  3 23:57:59 sibirocobombus systemd[24942]: Listening on GnuPG network certificate management daemon.
Jul  3 23:57:59 sibirocobombus systemd[24942]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jul  3 23:57:59 sibirocobombus systemd[24942]: Reached target Timers.
Jul  3 23:57:59 sibirocobombus systemd[24942]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Jul  3 23:57:59 sibirocobombus systemd[24942]: Listening on GnuPG cryptographic agent (access for web browsers).
Jul  3 23:57:59 sibirocobombus systemd[24942]: Reached target Sockets.
Jul  3 23:57:59 sibirocobombus systemd[24942]: Reached target Paths.
Jul  3 23:57:59 sibirocobombus systemd[24942]: Reached target Basic System.
Jul  3 23:57:59 sibirocobombus systemd[24942]: Reached target Default.
Jul  3 23:57:59 sibirocobombus systemd[24942]: Startup finished in 25ms.
Jul  3 23:57:59 sibirocobombus systemd[1]: Started User Manager for UID 1001.
Jul  3 23:57:59 sibirocobombus systemd[1]: Started Session 394506 of user git.
Jul  3 23:58:00 sibirocobombus systemd[1]: Stopping User Manager for UID 1001...
Jul  3 23:58:00 sibirocobombus systemd[24942]: Stopped target Default.
Jul  3 23:58:00 sibirocobombus systemd[24942]: Stopped target Basic System.
Jul  3 23:58:00 sibirocobombus systemd[24942]: Stopped target Timers.
Jul  3 23:58:00 sibirocobombus systemd[24942]: Stopped target Paths.
Jul  3 23:58:00 sibirocobombus systemd[24942]: Stopped target Sockets.
Jul  3 23:58:00 sibirocobombus systemd[24942]: Closed GnuPG cryptographic agent and passphrase cache.
Jul  3 23:58:00 sibirocobombus systemd[24942]: Closed GnuPG cryptographic agent (access for web browsers).
Jul  3 23:58:00 sibirocobombus systemd[24942]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jul  3 23:58:00 sibirocobombus systemd[24942]: Closed GnuPG network certificate management daemon.
Jul  3 23:58:00 sibirocobombus systemd[24942]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jul  3 23:58:00 sibirocobombus systemd[24942]: Reached target Shutdown.
Jul  3 23:58:00 sibirocobombus systemd[24942]: Starting Exit the Session...
Jul  3 23:58:00 sibirocobombus systemd[24942]: Received SIGRTMIN+24 from PID 24957 (kill).
Jul  3 23:58:00 sibirocobombus systemd[1]: Stopped User Manager for UID 1001.
Jul  3 23:58:00 sibirocobombus systemd[1]: Removed slice User Slice of git.
Jul  4 00:00:01 sibirocobombus CRON[25130]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul  4 00:00:01 sibirocobombus CRON[25131]: (alex) CMD (/home/alex/bin/trunk-bot)
Jul  4 00:00:01 sibirocobombus CRON[25133]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul  4 00:00:01 sibirocobombus CRON[25132]: (root) CMD (   test -x /usr/sbin/tigercron && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; nice -n$NICETIGER /usr/sbin/tigercron -q ; })
Jul  4 00:00:08 sibirocobombus crontab[26499]: (root) LIST (nobody)
Jul  4 00:02:01 sibirocobombus CRON[29160]: (logcheck) CMD (   if [ -x /usr/sbin/logcheck ]; then nice -n10 /usr/sbin/logcheck; fi)
Jul  4 00:05:01 sibirocobombus CRON[29266]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul  4 00:05:01 sibirocobombus CRON[29267]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul  4 00:06:01 sibirocobombus CRON[29598]: (alex) CMD (/home/alex/bin/trunk-bot)
Jul  4 00:07:01 sibirocobombus CRON[29644]: (alex) CMD (/home/alex/bin/planet-osr)

OK, so shortly before midnight, there’s some `systemd` activity, but nothing stands out, I think?

For the moment, there doesn’t seem to be anything I can learn from the experience.

1. Apache was unreachable

2. SSH daemon was unreachable

I didn’t check ping, nor the Gopher services. Perhaps I should have, before rebooting.

Log files are looking good. Monitoring seems to have worked just fine.

I have no idea what happened.

As you can see in the logs above I have some bots running via cron jobs. Both of them mailed me error messages.

At 09:08 and at 09:14, when the system was unreachable, I got the error “socket.gaierror: [Errno -3] Temporary failure in name resolution.”

At 10:00, after the reboot, I got a message from Tiger:

# Checking listening processes
OLD: --WARN-- [lin003w] The process `systemd-resolve' is listening on socket 53 (UDP on 127.0.0.53 interface) is run by systemd-resolve.
OLD: --WARN-- [lin003w] The process `systemd-resolve' is listening on socket 5355 (TCP on every interface) is run by systemd-resolve.
OLD: --WARN-- [lin003w] The process `systemd-resolve' is listening on socket 5355 (UDP on every interface) is run by systemd-resolve.

So... what happened? I’m skimming `journalctl -r` but that just starts with the boot and `journalctl --list-boots` shows I have just this one to go by. Sad! I guess we’ll never know...

​#Administration

Comments

(Please contact me if you want to remove your comment.)

I’m not sure we’re done, here. I just tried contacting websites and was getting proxy errors: Apache was unable to contact the Hypnotoad services!

Monit reported all of them up and running. Munin reported load average around 0.6. What the hell?

1. I added a ton of IP numbers with suspicious activity in my access log to the black list

2. I restarted all the Hypnotoads

Let’s see what happens.

– Alex Schroeder 2019-07-07 14:14 UTC

---

Same thing, today! I just had to reload all the Hypnotoads. When it’s happening, I’m getting 502 Proxy errors and cannot connect to my sites via Apache, and I can’t connect to the same sites directly via HTTP. This includes the Mojolicious::Plugin::Status page. Thus, the plugin is currently useless for debugging this.

Mojolicious::Plugin::Status

1. CPU% is not suspicious

2. Load is not suspicious

3. Apache hits are not suspicious

Let’s take a look at the Apache result codes in 10min time groupings:

tail -n 10000 /var/log/apache2/access.log | /home/alex/bin/time-grouping 10 | cut -c 1-80
   08/Jul/2019:11:30        581     5%  200 (62%), 403 (21%), 301 (6%), 408 (3%)
   08/Jul/2019:11:20        885     8%  200 (67%), 301 (16%), 304 (6%), 404 (5%)
   08/Jul/2019:11:10       1265    12%  200 (36%), 502 (29%), 403 (19%), 301 (8%
   08/Jul/2019:11:00        721     7%  200 (75%), 301 (11%), 408 (3%), 304 (3%)
   08/Jul/2019:10:50        798     7%  200 (79%), 301 (11%), 304 (3%), 404 (2%)
   08/Jul/2019:10:40        780     7%  200 (74%), 301 (15%), 404 (3%), 304 (2%)
   08/Jul/2019:10:30        701     7%  200 (72%), 301 (13%), 408 (5%), 304 (3%)
   08/Jul/2019:10:20        757     7%  200 (72%), 301 (16%), 304 (4%), 408 (3%)
   08/Jul/2019:10:10        622     6%  200 (68%), 301 (17%), 408 (4%), 404 (4%)
   08/Jul/2019:10:00        876     8%  200 (75%), 301 (9%), 403 (5%), 408 (4%),
   08/Jul/2019:09:50        701     7%  200 (76%), 301 (11%), 404 (3%), 304 (3%)
   08/Jul/2019:09:40        790     7%  200 (76%), 301 (10%), 408 (4%), 304 (4%)
   08/Jul/2019:09:30        523     5%  200 (78%), 301 (10%), 408 (4%), 404 (3%)

Clearly, at 11:10 the requests nearly double and about a third (29%) result in 502 (proxy error) codes. But there are also 19% resulting in 403 (forbidden). The typical 200 (ok) results drop significantly.

New approach!

I’m going to use the rules listed in Stop fake and bad user agents with htaccess. Let’s hope this works! 🙂

Stop fake and bad user agents with htaccess

– Alex Schroeder 2019-07-08 10:28 UTC

---

And it just happened again:

time-grouping | cut -c 1-80
   08/Jul/2019:13:20        122     1%  200 (60%), 403 (18%), 301 (13%), 404 (3%
   08/Jul/2019:13:10        737     7%  200 (75%), 301 (12%), 403 (4%), 404 (3%)
→  08/Jul/2019:13:00        615     6%  502 (44%), 200 (29%), 301 (16%), 403 (3%
→  08/Jul/2019:12:50        642     6%  200 (37%), 502 (34%), 301 (15%), 403 (5%
   08/Jul/2019:12:40        719     7%  200 (68%), 301 (14%), 403 (5%), 408 (4%)
   08/Jul/2019:12:30        774     7%  200 (70%), 301 (13%), 403 (6%), 408 (3%)
   08/Jul/2019:12:20        665     6%  200 (70%), 301 (12%), 304 (7%), 404 (3%)
   08/Jul/2019:12:10        763     7%  200 (73%), 301 (12%), 304 (6%), 404 (3%)
   08/Jul/2019:12:00        848     8%  200 (76%), 301 (12%), 304 (4%), 404 (2%)
   08/Jul/2019:11:50        887     8%  200 (75%), 301 (6%), 304 (6%), 404 (3%),
   08/Jul/2019:11:40        862     8%  200 (78%), 301 (8%), 304 (6%), 408 (3%),
   08/Jul/2019:11:30        809     8%  200 (66%), 403 (15%), 301 (7%), 304 (4%)
   08/Jul/2019:11:20        885     8%  200 (67%), 301 (16%), 304 (6%), 404 (5%)
   08/Jul/2019:11:10        672     6%  200 (41%), 403 (22%), 502 (21%), 301 (8%

I’m out of ideas... I’ll try changing all the Hypnotoad configurations and increase the number of workers per application from 1 to 2.

– Alex Schroeder 2019-07-08 11:20 UTC