Mirabel

Trying to figure out what causes the occasional load problems on mirabel.epfarms.org. Some days the websites I host there – the most important ones are campaignwiki.org and communitywiki.org – will not work. Why? I have no idea. This is a Debian shared hosting system.

php-cgi

Sometimes one of the sites gets hit by spammers. You’ll notice this by seeing a username you have never seen before running a lot of php-cgi processes.

In their htdocs, update or create the `.htaccess` file with the following to disable PHP for a particular site:

disable PHP for a particular site

php_flag engine off

Sometimes these people run multiple sites. Here’s how to show their environment.

show their environment

2011 and earlier

When I connect via ssh, I like to run `top`. This is what I see minutes after a server restart:

top - 17:32:24 up 56 days,  2:38,  6 users,  load average: 85.19, 44.79, 64.64
Tasks: 273 total,   2 running, 262 sleeping,   0 stopped,   9 zombie
Cpu(s):  4.3%us,  3.4%sy,  0.0%ni,  0.0%id, 92.0%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   2062832k total,  2043120k used,    19712k free,     5792k buffers
Swap:  8388600k total,  3274276k used,  5114324k free,    35860k cached

Load is shooting up into the air again. Practically no processes are running. There’s a lot of *waiting* going on.

The admin tells me the raid is ok. This is not about failing disks.

alex@mirabel:~$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Mon Mar 24 03:42:25 2008
     Raid Level : raid5
     Array Size : 180153984 (171.81 GiB 184.48 GB)
  Used Dev Size : 60051328 (57.27 GiB 61.49 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sun May 16 17:57:40 2010
          State : active
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : dd3b4aab:46063cd1:e368bf24:bd0fce41
         Events : 0.48049

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1

Another friend told me to look at stalling HTTP requests.

alex@mirabel:~$ sudo lsof -i :80|wc -l
165

But how to fix it?

Twice now the admin said that the problem was the system running out of diskspace. I think I checked it the last time around using `df -h` and `df -i` but didn’t notice anything unusual.

A few days later, we had the same situation... `load average: 151.56` – `sudo lsof -i :80|wc -l` → 9 – so this is not a cause – and `df -h` which I wanted to study the last time says:

Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              171G  140G   23G  86% /
varrun               1008M  120K 1008M   1% /var/run
varlock              1008M     0 1008M   0% /var/lock
udev                 1008M   60K 1008M   1% /dev
devshm               1008M     0 1008M   0% /dev/shm
lrm                  1008M   45M  964M   5% /lib/modules/2.6.24-24-server/volatile
/dev/sdd4             8.2G  221M  7.6G   3% /boot
tmpfs                1008M   45M  964M   5% /lib/modules/2.6.24-27-server/volatile

I don’t see a device running out of disk space.

There aren’t many processes running, either:

top - 12:50:21 up 58 days, 21:56,  5 users,  load average: 63.10, 106.94, 103.96
Tasks: 109 total,   2 running, 103 sleeping,   0 stopped,   4 zombie
Cpu(s):  0.3%us,  1.0%sy,  0.0%ni, 49.5%id, 49.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2062832k total,   303556k used,  1759276k free,     4780k buffers
Swap:  8388600k total,  2850064k used,  5538536k free,    40736k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
32309 root      20   0  158m  16m 3168 D    2  0.8   0:00.22 apache2
21657 mysql     18  -2 3720m  59m 2340 S    0  2.9 234:30.98 mysqld
32211 alex      20   0 18992 1392  936 R    0  0.1   0:00.44 top
    1 root      20   0  4020  140   64 S    0  0.0   0:18.47 init
    2 root      15  -5     0    0    0 S    0  0.0   0:00.05 kthreadd
    3 root      RT  -5     0    0    0 S    0  0.0   0:28.23 migration/0
    4 root      15  -5     0    0    0 S    0  0.0   0:56.75 ksoftirqd/0
    5 root      RT  -5     0    0    0 S    0  0.0   0:01.36 watchdog/0
    6 root      RT  -5     0    0    0 S    0  0.0   0:50.89 migration/1
    7 root      15  -5     0    0    0 S    0  0.0   0:28.72 ksoftirqd/1
    8 root      RT  -5     0    0    0 S    0  0.0   0:01.06 watchdog/1
    9 root      15  -5     0    0    0 S    0  0.0   1:38.27 events/0
   10 root      15  -5     0    0    0 S    0  0.0   2:06.44 events/1
   11 root      15  -5     0    0    0 S    0  0.0   0:00.01 khelper
   44 root      15  -5     0    0    0 S    0  0.0   0:10.51 kblockd/0
   45 root      15  -5     0    0    0 S    0  0.0   0:51.72 kblockd/1
   48 root      15  -5     0    0    0 S    0  0.0   0:00.00 kacpid
   49 root      15  -5     0    0    0 S    0  0.0   0:00.00 kacpi_notify
  132 root      15  -5     0    0    0 S    0  0.0   0:00.00 kseriod
  184 root      15  -5     0    0    0 S    0  0.0  18:39.70 kswapd0
  227 root      15  -5     0    0    0 S    0  0.0   0:00.00 aio/0
  228 root      15  -5     0    0    0 S    0  0.0   0:00.00 aio/1
 1442 root      15  -5     0    0    0 S    0  0.0   0:00.00 ksuspend_usbd
 1445 root      15  -5     0    0    0 S    0  0.0   0:00.00 khubd
 1472 root      15  -5     0    0    0 S    0  0.0   0:00.00 ata/0
 1474 root      15  -5     0    0    0 S    0  0.0   0:00.00 ata/1
etc.

Hm, strange. Load seems to be dropping fast! `load average: 13.22, 77.95, 93.90` – I wonder what’s going on!

Restarting stuff on the go…

Restarting stuff on the go…

Must remember something like the following:

Comments

(Please contact me if you want to remove your comment.)

Hopefully all is back and working just fine once more. Thanks for your persistence, Alex :D

– greywulf 2010-05-17 05:52 UTC

greywulf

---

The problem with virtualized hosts is that it can be other virtual hosts (or the machine hosting the virtual hosts) causing the load and then you have no way to control your virtual machines’ behavior at all.

– Harald Wagener 2010-05-17 13:12 UTC

Harald Wagener