Behind The Scenes

I loved HowStuffWorks as a kid, and I love "how-to" guides, especially when they explain the rationale behind design decisions.

https://www.howstuffworks.com

Today we'll talk about the Gemini server that powers this capsule and runs on an ESP32 development board. (All code is simplified and some uninteresting parts are omitted.)

stuff.bin

stuff.bin is a binary file produced by this Python script:

import struct
import os
import sys
import zlib

blobs = []
for f in os.listdir(sys.argv[1]):
    with open(os.path.join(sys.argv[1], f), "rb") as fp:
        compo = zlib.compressobj(9, zlib.DEFLATED, -15, 9, zlib.Z_DEFAULT_STRATEGY)
        data = fp.read()
        cdata = compo.compress(data) + compo.flush()
        blobs.append((f, cdata))

with open(sys.argv[2], "wb") as fp:
    offset = len(blobs) * struct.calcsize("III") + struct.calcsize("I")
    for f, blob in blobs:
        hash = zlib.crc32(b"/" + f.encode('utf-8'))
        fp.write(struct.pack("!III", hash, offset, len(blob)))
        offset += len(blob)

    fp.write(struct.pack("I", 0))

    for _, blob in blobs:
        fp.write(blob)

The script accepts two arguments: a directory with .gmi files and an output path. It produces a binary file that consists of two parts.

The first is an array of data structures like this one:

typedef struct {
    uint32_t hash;
    uint32_t off;
    uint32_t len;
} bin_stuff_hdr_t;

Each element in the array specifies the CRC32 of a URI (for example, /index.gmi), an offset within the file and the file size. The end of this array is marked by 4 zero bytes.

The second part of the file contains concatenated, Deflate-compressed files. The compression ratio of Gemtext tends to be very good, and the difference between Deflate and LZMA is negligible.

stuff.bin is embedded into the ESP32 application image:

idf_component_register(...
                    ...
                    EMBED_TXTFILES ... "stuff.bin")

The CMake-based build system probably uses objdump to do that: I've used this technique in the past. objdump(1) says:

       -B bfdarch
       --binary-architecture=bfdarch
           Useful when transforming a architecture-less input file into an
           object file. [...]
           [...]
           These symbols are called _binary_objfile_start,
           _binary_objfile_end and _binary_objfile_size.  e.g. you can
           transform a picture file into an object file and then access it in
           your code using these symbols.

Therefore, to access stuff.bin from the code, all we have to do is:

    extern const unsigned char stuff_bin_start[] asm("_binary_stuff_bin_start");
    extern const unsigned char stuff_bin_end[] asm("_binary_stuff_bin_end");

When the client requests a file, the Gemini server locates the first / in the requested URL, or falls back to /index.gmi. The path is hashed and the array is used to find the offset of the requested file within stuff.bin:

    path = strchr(&request[sizeof("gemini://") - 1], '/');
    if (!path || (path[0] == '/' && !path[1])) {
        hash = htonl(mz_crc32(MZ_CRC32_INIT, (const unsigned char *)"/index.gmi", sizeof("/index.gmi") - 1));
    } else {
        hash = htonl(mz_crc32(MZ_CRC32_INIT, (const unsigned char *)path, len - (path - request)));
    }

    for (hdr = (bin_stuff_hdr_t *)stuff_bin_start; hdr->hash != 0; ++hdr) {
        if (hdr->hash == hash) {
            p = stuff_bin_start + ntohl(hdr->off);
            break;
        }
    }

At first, I used memmem() and a magic marker (something like ">>> /index.html") before each .gmi file, and it was painfully slow even with a small stuff.bin. The O(n) lookup using 32-bit checksums (where n is the number of files) is super ugly, but fast and predictable.

After this lookup, the Gemini server allocates a small chunk of memory, prepares for decompression and sends the status line:

    static const char ok[] = "20 text/gemini\r\n";
    static const char error[] = "50 Error\r\n";

    out = malloc(INFLATE_BUFSIZ);

    if (p >= stuff_bin_start && ntohl(hdr->len) <= stuff_bin_end - stuff_bin_start && mz_inflateInit2(&strm, -15) == MZ_OK && out) {
        if (send_all(tls, ok, sizeof(ok) - 1) != sizeof(ok) - 1) {
            ESP_LOGI(TAG, LOG_FMT("failed to send status line to socket %d: %d"), fd, chunk);
            return;
        }

Once the status line is sent, the server decompresses the file using miniz, one chunk at a time, and sends the chunk:

        strm.next_in = p;
        strm.avail_in = ntohl(hdr->len);

        int mz_ret;
        do {
            strm.next_out = out;
            strm.avail_out = INFLATE_BUFSIZ;
            mz_ret = mz_inflate(&strm, MZ_NO_FLUSH);
            if (mz_ret != MZ_OK && mz_ret != MZ_STREAM_END) break;
            if (send_all(tls, out, INFLATE_BUFSIZ - strm.avail_out) != INFLATE_BUFSIZ - strm.avail_out) break;
        } while (mz_ret != MZ_STREAM_END);

miniz is a small Deflate decompressor, perfect for a small application:

https://github.com/richgel999/miniz

The ESP32 board doesn't have much space and the compiled size of miniz is a worthwhile investment, because this capsule's stuff.bin is 62% smaller thanks to compression. In the future, as the capsule content grows in size, maybe I'll need to compress the concatenated files together (as one continous blob), to achieve a higher compression ratio.

I had to choose between chunked decompression and decompression in one go, and went with chunked decompression. I don't want the server to waste CPU cycles on decompression of the entire .gmi file if the client is slow and fails to receive the beginning of the file in a reasonable time.

The median size of .gmi files in this capsule is about 3K at the moment, and 512 bytes is a reasonable buffer size: it's small enough to reduce memory consumption but still large enough to stream the content with reasonable speed (some Gemini clients display the content during download!).

Requests

Every incoming request is handled by a separate "thread":

static void do_client_routine(esp_tls_t *tls, int fd, esp_tls_cfg_server_t *tls_cfg, SemaphoreHandle_t lock, TickType_t *ts)
{
    int ret = esp_tls_server_session_create(tls_cfg, fd, tls);
    if (ret != 0) return;

    do {
        if ((chunk = esp_tls_conn_read(tls, &request[len], sizeof(request) - len)) <= 0) {
            if (chunk == ESP_TLS_ERR_SSL_WANT_READ || chunk == ESP_TLS_ERR_SSL_WANT_WRITE) continue;
            return;
        }

        len += chunk;
    } while (len < sizeof(request) && (request[len - 2] != '\r' || request[len - 1] != '\n'));

    if (request[len - 2] != '\r' || request[len - 1] != '\n') {
        return;
    }

    request[len - 2] = '\0';
    len -= 2;

    if (strncmp(request, "gemini://", sizeof("gemini://") - 1) != 0) return;
 
    ...
}

static void client_routine(void *pvParameters)
{
    gemini_socket_t *sock = pvParameters;
    esp_tls_t *tls = NULL;
    int fd;

    if (xSemaphoreTake(sock->lock, TIMEOUT_TICKS)) {
        fd = sock->fd;
        xSemaphoreGive(sock->lock);
    } else {
        /* should't happen */
        return;
    }

    tls = esp_tls_init();
    if (tls) do_client_routine(tls, fd, sock->tls_cfg, sock->lock, &sock->ts);

    if (xSemaphoreTake(sock->lock, TIMEOUT_TICKS)) {
        sock->fd = -1;
        xSemaphoreGive(sock->lock);
    }

    close(fd);

    if (tls) {
        esp_tls_server_session_delete(tls);
    }

    vTaskDelete(NULL);
}

This thread performs the TLS handshake, receives the request (which ends with \r\n), then calls the code we've already seen, which extracts the response from stuff.bin and sends it.

It looks like esp_tls_conn_read() blocks even if O_NONBLOCK is enabled: without a separate thread of execution, other requests are blocked while we're busy doing the handshake. The separate thread and the memory used by its stack, etc', are not a disaster, because the number of requests we can handle is limited anyway, as we'll see later. The maximum memory consumption of the server is easy to calculate, and it's something the ESP32 can handle.

The Server

The server itself is not very interesting: it's just a listening TCP socket, plus loading of the server certificate and private key.

static esp_err_t init(gemini_server_t *s)
{
#if CONFIG_LWIP_IPV6
    int fd = socket(PF_INET6, SOCK_STREAM, 0);
#else
    int fd = socket(PF_INET, SOCK_STREAM, 0);
#endif
    if (fd < 0) return ESP_FAIL;
#if CONFIG_LWIP_IPV6
    struct in6_addr inaddr_any = IN6ADDR_ANY_INIT;
    struct sockaddr_in6 serv_addr = {
        .sin6_family  = PF_INET6,
        .sin6_addr    = inaddr_any,
        .sin6_port    = htons(1965)
    };
#else
    struct sockaddr_in serv_addr = {
        .sin_family   = PF_INET,
        .sin_addr     = {
            .s_addr = htonl(INADDR_ANY)
        },
        .sin_port     = htons(1965)
    };
#endif
    int enable = 1;
    setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(enable));

    int ret = bind(fd, (struct sockaddr *)&serv_addr, sizeof(serv_addr));
    if (ret < 0) {
        close(fd);
        return ESP_FAIL;
    }

    ret = listen(fd, 5);
    if (ret < 0) {
        close(fd);
        return ESP_FAIL;
    }

    s->listen_fd = fd;
    return ESP_OK;
}

void gemini_server_task(void *pvParameters)
{
    gemini_server_t *s;

    extern const unsigned char cert_pem_start[] asm("_binary_cert_pem_start");
    extern const unsigned char cert_pem_end[]   asm("_binary_cert_pem_end");

    extern const unsigned char key_pem_start[] asm("_binary_key_pem_start");
    extern const unsigned char key_pem_end[]   asm("_binary_key_pem_end");

    s = calloc(1, sizeof(*s));
    if (!s) return;

    for (int i = 0; i < sizeof(s->sockets) / sizeof(s->sockets[0]); ++i) {
        s->sockets[i].fd = -1;
        s->sockets[i].lock = xSemaphoreCreateMutex();
        s->sockets[i].tls_cfg = &s->tls_cfg;
    }

    s->tls_cfg.servercert_buf = cert_pem_start;
    s->tls_cfg.servercert_bytes = cert_pem_end - cert_pem_start;
    s->tls_cfg.serverkey_buf = key_pem_start;
    s->tls_cfg.serverkey_bytes = key_pem_end - key_pem_start;

    if (init(s) != ESP_OK) return;

    while (tick(s) == ESP_OK);
    close(s->listen_fd);
    free(s);

    vTaskDelete(NULL);
}

The tick() function calls select() with timeout: if we have an incoming connection, it calls accept() and spawns a "task" that does the handshake, receives the request and sends a response.

tick() also looks for old connections that need to be closed. The accept() timestamp is saved aside, so tick() can close a slow connection. The ESP32 is limited to CONFIG_LWIP_MAX_SOCKETS (<= 16) sockets, so we can handle up to 15 (16 minus one listening socket) requests concurrently and must protect against DoS.

The most important thing I omitted from this snippet, in the name of clarity, is synchronization: the server has an array of CONFIG_LWIP_MAX_SOCKETS-1 "slots" for incoming requests, and each of this contains a mutex. This mutex guards access to the file descriptor and the timestamp, so the "server task" can find the oldest client and close its socket safely, when all slots are occupied, or when a request times out.

Entry Point

When the ESP32 starts, it connects to my Wi-Fi network:

static void wifi_conn_init(void)
{
    ESP_ERROR_CHECK(esp_netif_init());

    ESP_ERROR_CHECK(esp_event_loop_create_default());
    esp_netif_create_default_wifi_sta();

    wifi_init_config_t cfg = WIFI_INIT_CONFIG_DEFAULT();
    ESP_ERROR_CHECK(esp_wifi_init(&cfg));

    ESP_ERROR_CHECK(esp_event_handler_instance_register(WIFI_EVENT,
                                                        ESP_EVENT_ANY_ID,
                                                        &wifi_event_handler,
                                                        NULL,
                                                        NULL));

    ESP_ERROR_CHECK(esp_event_handler_instance_register(IP_EVENT,
                                                        ESP_EVENT_ANY_ID,
                                                        &ip_event_handler,
                                                        NULL,
                                                        NULL));

    wifi_config_t wifi_config = {
        .sta = {
            .ssid = WIFI_SSID,
            .password = WIFI_PASS,
        },
    };
    ESP_ERROR_CHECK(esp_wifi_set_mode(WIFI_MODE_STA) );
    ESP_ERROR_CHECK(esp_wifi_set_config(WIFI_IF_STA, &wifi_config) );
    ESP_ERROR_CHECK(esp_wifi_start() );
}

void app_main(void)
{
    ESP_ERROR_CHECK( nvs_flash_init() );
    wifi_conn_init();
}

When the ESP32 obtains an IP address, it starts a Gemini server "task":

static void ip_event_handler(void *arg, esp_event_base_t event_base,
                               int32_t event_id, void *event_data)
{
    if (event_id == IP_EVENT_STA_GOT_IP)
        xTaskCreate(&gemini_server_task, "gemid", 10240, NULL, tskIDLE_PRIORITY+5, NULL);
}

If the ESP32 gets disconnected from the Wi-Fi network, which tends to happen at night (that's why my capsule was down every once in a while, before I fixed this), it reconnects:

static void wifi_event_handler(void *arg, esp_event_base_t event_base,
                               int32_t event_id, void *event_data)
{
    if (event_id == WIFI_EVENT_STA_DISCONNECTED)
        esp_wifi_connect();
    else if (event_id == WIFI_EVENT_STA_START)
        esp_wifi_connect(); 
}

Dynamic DNS

In addition to the Gemini server part, the application obtains my router's external IP address via ifconfig.me and updates dimkr-esp32.duckdns.org via the Duck DNS API. This code is not very interesting, and almost identical to the HTTPS client example.

https://www.duckdns.org

Right now, this is done when the ESP32 obtains an IP address, and it works fine because my IP address doesn't change much, although I haven't paid extra to get a static external address. In the future, maybe I'll need to do this update periodically: say, at least once a day, just in case my router lost its previous external address.

The router is configured to forward port 1965 to the board and gemini.dimakrasner.com is a CNAME record (an alias) for dimkr-esp32.duckdns.org.