OpenBSD Debugging

Complicated systems tend to be complicated to debug. There are prerequisites, for starters the fact that it will be difficult to debug a system you know little to nothing about. Humility is in order; it's okay to admit you have no clue where the problem is. Skepticism is also welcome; too often one might blame the database, because it's always the database, no it cannot be my application, I'm sure it is not my application, I've already checked that, so obviously it cannot be the application. (It was the application.)

"Advanced Programming in the Unix Environment" is recommended to learn how a unix system works, though there are other texts that cover specific unix systems or the art of systems administration. Another option might be to work at a consulting firm where there are others you can kick an issue over to that you are stuck on, or at least lots of work to be had. "Learn by doing" suggests that one sets up and breaks systems, a lot. This is fairly easy with virtual systems. Ancient hardware is still an option, especially if you want to break the bootloader on a real system in various ways.

Probably you want configuration management of some sort, so the system can be setup in a particular way with lower odds of something going awry. Here are two simple ones to consider:

https://github.com/rollcat/judo

https://github.com/eradman/rset

Methods

Situation

Since web things seem popular and can involve complexity--proxies, daemons, protocols, chroots--let us inspect a buggy web system. This is a simple system as far as some web setups go, as OpenBSD 7.2 lacks containers and cloud orchestration frameworks. Still, it hopefully should be educational.

First we must define what our system is supposed to do; just as it is hard to complete a project without the requirements, it is difficult to debug something if one does not know what the correct state should be. The metaquestion or question about the question of whether the system even makes sense is not considered here, though could be important to address in real-world systems.

Bueno. Our service will be simple though of complicated implementation, as we desire something that is easy to understand but difficult to debug: a virtual Rube Goldberg machine. This service, given user input, will return that input rotated by 13 characters. Few additional details will be necessary; let us say our service accepts a request of /cipher/plaintext where the plaintext is composed of ASCII [a-zA-Z] and returns the ciphertext as text/plain in the body of the HTTP response.

And of course it does not work.

    $ ftp http://example.org/cipher/caesar
    ...
    ftp: connect: Permission denied

Generally lower levels should be first established as working. There is little reason to debug software issues when a system is powered off, for example. Users may still report it as a software issue. Monitoring or a dashboard may help here--is DNS responding within allowed latency? Are health checks failing? Otherwise, basic unix sysadmin skills apply. We'll assume that the system is actually up and running.

    $ pgrep http
    $ netstat -lnp tcp
    Active Internet connections (only servers)
    Proto   Recv-Q Send-Q  Local Address          Foreign Address        TCP-State
    tcp          0      0  *.22                   *.*                    LISTEN
    Active Internet connections (only servers)
    Proto   Recv-Q Send-Q  Local Address          Foreign Address        TCP-State
    tcp6         0      0  *.22                   *.*                    LISTEN

We probably should turn on the web thing.

    $ doas rcctl start httpd
    httpd(ok)
    $ pgrep http | wc -l
           5

Mission accomplished!

Nope, says the customer, still down:

    $ ftp http://example.org/cipher/caesar
    ...
    ftp: connect: Permission denied
    $ curl http://example.org/cipher/asdf
    curl: (7) Couldn't connect to server

At least they also tried a different client; the issue could be due to a specific web client doing something stupid, or perhaps a particular server is incompatible with a particular client because it hates the User-Agent, or TLS cannot negotiate, etc. Can you devise a test that would help determine where the problem is? The trade-off here is too much time spent testing versus having tried enough different things to help outline what the actual problem is.

The next obvious thing to check is the firewall, or firewalls. There could be multiple firewalls involved, though outbound firewalls tend to be pretty rare. Here we might find a rule blocking the request; real firewall rules can be much more complicated and thus more difficult to reason about and debug.

    block return-icmp ( 3, 3 ) proto tcp from any to any port 80

Perhaps this bad firewall rule prevented a health check from working and then something stopped trying to restart the httpd service after too many errors? A real-life example of this would be a lot of NetworkManager instances giving up before a network switch completed its health checks after a power loss: two failures for the price of one cause. Problems often multiply in this manner.

Note that multiple changes have been made to the system, first to turn on the httpd service, and then to fix the firewall rules. This can complicate debugging if it turns out that httpd should not have been enabled. Maybe the host uses a different web server, and now the firewall rules are fixed but the wrong web server is running. This is where documentation or configuration management can help: host X is special--it runs the custom software Y unlike the Z used on all the others. Humans are good at forgetting about special cases or unexpected changes.

And after correcting the firewall:

    ...
    $ doas pfctl -f /etc/pf.conf
    $ ftp http://example.org/cipher/caesar
    ...
    ftp: connect: Connection refused

Progress, as measured by the rate at which new error messages are encountered. Note that different failures can result in the same error message, sometimes. Where is httpd listening? We forgot to check that.

    $ fstat -p `pgrep -f httpd:\ server | sed 1q` | grep tcp
    www      httpd      10175    6* internet6 stream tcp 0x0 *:8080
    www      httpd      10175    7* internet stream tcp 0x0 *:8080

So, on a high port. Presumably this is normal, which probably means something else redirects requests to the high port--a firewall, or a proxy. Here we could branch and either test a :8080 URL, or delve into what if anything is redirecting the requests from :80 to :8080. Debugging is never so linear as this sort of document presents; it is more like a maze of dead ends, or untangling a ball of string by tugging first here and then there and then back on here again.

Let's fix port 80 first. Once that is fixed our customer will be more easily able to test the real URL, and is less likely to end up with an incorrect :8080 link cached somewhere. Users love to share inappropriate internal links; best not to let them leak. Also an entire site is probably more important to fix before we get to specific resources within that site. Yep, the whole site is down. Or at least also the root page.

    $ ftp -o - http://example.org/
    ...
    ftp: connect: Connection refused

Monitoring results collected into a dashboard can help show what is down and therefore what is most important to work on, rather than discovering these facts slowly through manual exploration or user complaints. Still, manual discovery has it's place: it's a fundamental skill, and how do you know what you need to put into monitoring without manually discovering the system? Ask the departed developers? Read the unwritten documentation?

I wonder what services aren't running.

    $ doas rcctl ls failed
    relayd

Keeping things running is not something that rc.d does.

    $ doas rcctl restart relayd
    relayd(ok)
    $ ftp -o - http://example.org 2>/dev/null
    this, is a test website
    $ ftp -o - http://example.org/cipher/caesar
    ...
    ftp: Receiving HTTP reply: Operation now in progress

A glance at the httpd.conf file might be good, to see what the website or in particular the "cipher" path requires.

        location  match "^/cipher/" {
                fastcgi socket tcp 127.0.0.1 1235
        }

Nothing was running on that port, according to the earlier netstat(1). What is likely is that someone setup the system manually, or without integrating the fastcgi process into the standard rc.d setup. This makes a system quick to setup, but more difficult to debug, especially if you are trying to clean up after something threw something together.

Here we are looking for something that (probably) mentions port 1235, though sometimes folks use the /etc/services string names instead of numbers. There is no service name for 1235, so there should be a literal 1235 somewhere. Or not.

    $ doas fgrep -rl 1235 /etc
    /etc/firmware/iwx-so-a0-gf-a0-67
    /etc/firmware/iwx-ty-a0-gf-a0-67
    /etc/moduli
    /etc/httpd.conf
    $ grep 1235 /etc/services
    $

Another common place is in crontab files, which can live outside of /etc. This is probably not an ideal place to manage a daemon from.

    $ doas grep 1235 -r /var/cron/tabs
    /var/cron/tabs/root:@reboot /usr/local/bin/spawn-fcgi -p 1235 -- /var/www/bin/cipher
    $ doas /usr/local/bin/spawn-fcgi -p 1235 -- /var/www/bin/cipher
    spawn-fcgi: child spawned successfully: PID: 2279
    $ pgrep cipher
    2279

Perhaps cron failed to run at startup for some reason and therefore did not start the fastcgi service?

    $ ftp -o - http://example.org/cipher/caesar
    ...
    ftp: Error retrieving http://example.org/cipher/caesar: 500 Internal Server Error

This is another branching point. We could try to find the source code for cipher and learn how it works, or we can treat it like a black box and try to run test inputs through it. Do you have test programs to feed test input into fastcgi or similar protocol handlers? An initial survey with file(1) and strings(1) might be good.

    $ file /var/www/bin/cipher
    /var/www/bin/cipher: ELF 64-bit LSB shared object, x86-64, version 1
    $ strings /var/www/bin/cipher
    ...
    getenv
    FCGI_printf
    chroot
    getline
    asprintf
    wait
    libfcgi.so.0.0
    libc.so.96.2
    /usr/games/rot13
    Status: 500 Internal Server Error

From this select list of strings we can assume a fastcgi process that probably performs a chroot and might try to call rot13(6) to perform the Caesar cipher. For more complicated programs, it may be harder to say exactly what they do due to all the system calls and functions compiled into them. For example, a golang FCGI program that prints "hello world" contains 39,543 strings compared to 56 strings in /var/www/bin/cipher. strings(1) however may be filtered for lines containing / to try to find configuration or other relevant filesystem paths, especially if a chroot is involved.

    $ strings /var/www/bin/cipher | fgrep /
    /usr/libexec/ld.so
    /cipher/
    Content-Type: text/plain
    /var/www
    /usr/games/rot13

ldd(1) may also relevant for things under a chroot. Notably none of the required libraries nor /var/www/usr/games/rot13 (nor libraries it needs) is present. Programs may also require various configurations files; ktrace(1) would be one way to discover those, if the program lacks good messages about what files it needs.

    $ ldd /var/www/bin/cipher
    /var/www/bin/cipher:
            Start            End              Type  Open Ref GrpRef Name
            000005a9ad46b000 000005a9ad470000 exe   1    0   0      /var/www/bin/cipher
            000005ac354dc000 000005ac354eb000 rlib  0    1   0      /usr/local/lib/libfcgi.so.0.0
            000005ac3de00000 000005ac3def4000 rlib  0    1   0      /usr/lib/libc.so.96.2
            000005ac7cab8000 000005ac7cab8000 ld.so 0    1   0      /usr/libexec/ld.so
    $ ls /var/www/usr
    ls: /var/www/usr: No such file or directory

Some commands probably should be tested with "echo" first instead of "doas". Also, spaces in the filenames will make life more complicated than presented here.

    $ ldd /var/www/bin/cipher |
      awk '$5 ~ 1{system("doas install -D " $NF " /var/www" $NF)}'
    $ doas install -D /usr/games/rot13 /var/www/usr/games/rot13
    $ ldd /usr/games/rot13 |
      awk '$5 ~ 1{system("doas install -D " $NF " /var/www" $NF)}'

And, finally, it works.

    $ ftp -o - http://example.org/cipher/caesar 2>/dev/null
    pnrfne
    $ echo caesar | /usr/games/rot13
    pnrfne

A diagram might be handy to draw, though for a service like this that is too complicated one might look instead to shut it down, or rewrite it: various improvements might be to not fork out to rot13, to make a static build so random outputs of ldd need not be copied into a chroot, etc. Other improvements might be to rewrite the @reboot cron job to be a proper service daemon, write some documentation about the service on a wiki, etc.

tags #debug #openbsd #unix