💾 Archived View for thrig.me › blog › 2024 › 05 › 18 › site-feeds.gmi captured on 2024-06-16 at 12:22:21. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Site Feeds

doing RSS by hand

By hand is probably tedious and error prone, but on the other hand automated Atom may involve libraries with dependencies (XML, date & time modules, etc) that may break on update (due to the many moving parts of all the libraries involved) or otherwise offer various security problems (supply chain attacks against all the dependencies, bugs in the libraries, etc). The good news here is that you are probably not processing untrusted, maybe malicious content, so the security issues are probably minor. On the other hand, you could throw all the XML stuff out on account of ETOOCOMPLICATED.

tossing XML out

Minimum XML

One problem with XML is that various characters must be escaped, "<" for instance. This makes generating valid XML slightly more difficult than just printing strings, and is why many folks will recommend using a library that handles all the fiddly details for you. Another problem for Atom (or RSS uses a different time format) is getting the time formatted just right, for which again a recommendation will be to use a library. And there are various encodings? A problem is that some libraries for XML and timestamp handling try to support everything and thus get pretty huge and thus may be more trouble than they are worth, to which "throw it all out" is a response.

Or you could try not to use fancy characters in titles, until you forget when writing the great American "notepad > *" blog post. Such a method works great, until it does not. But wait, there's more! If someone untrusted is supplying the blog posts they could put in "...</title></entry><entry><title>..." and insert new entries and links and whatnot using text that was supposed to only be for the title of a posting. This attack goes by the name of XSS or cross site scripting. Probably best to avoid it.

So we need to encode those wacky "<" and other characters,

    $ perl -MHTML::Entities -E 'say encode_entities(q{<"'\''>})'
    &lt;&quot;&#39;&gt;
    $ perl -MHTML::Entities=encode_entities_numeric \
           -E 'say encode_entities_numeric(q{<"'\''>})'
    &#x3C;&#x22;&#x27;&#x3E;

which means things already have gotten pretty yucky and you may need to peer at ascii(7) to confirm that the character encoding and/or shell quoting has been done correctly. The "'\''" trick turns off the prior single quote block, inserts a literal "'", and then starts a new single quote block. There are other ways to shell quote values, but I find this method pretty mechanical: single quote the whole thing, then replace any "'" within the single quoted thing with "'\''". Or sometimes you can change the "'" to be represented by something else. Or you could write a standalone script to avoid the problems of shell quoting.

    $ perl -MHTML::Entities -E 'say encode_entities(qq{<"\047>})'
    &lt;&quot;&#39;&gt;
    $ cat quoteit
    #!/usr/bin/perl
    use 5.36.0;
    use HTML::Entities;
    say encode_entities q{<"'>}
    $ perl quoteit
    &lt;&quot;&#39;&gt;

The HTML::Entities module by the way is part of the HTML-Parser distribution so if you only wanted entity quoting, you would here also be pulling in a full HTML parser with all sorts of interesting files such as "hparser.c" (the things people did to emulate what buggy browsers allow for). This is one way software dependencies can snowball on you. On the other hand you probably do want to check your code against some other implementation to help ensure that you did not make a mistake.

    $ echo 'notepad > *' | sh encode
    notepad >gt; *

Whoops.

    $ echo 'notepad > *' | sh encode
    notepad &gt; *
    $ cat encode
    sed 's/</\&lt;/g;s/>/\&gt;/g;s/"/\&quot;/g;s/'\''/\&#39;/g'

Microsoft Omake Theater Interlude

    <a3de751> Not only have I had A Day, fucking Microsoft pushed that
              "ads in the start menu" update on me and it broke, well,
              basically everything.
    <a3de751> Everything including, most amusingly, the start menu.
    <thrig> herp derp
    <a3de751> It's OK, I uninstalled "KB0571095696766677651 (Misc.
              Updates)" and now it's back.

Timestamps

Atom appears to use ISO 8601, for example "2024-05-17T13:25:53Z". Gemfeeds lack the hours, minutes, and seconds portions so might default to midnight? Maybe the date(1) command can be used to convert from one format to another. This depends on what your input format is—unix epoch?—YYYY-MM-DD?—other? For timezone sanity I keep everything in UTC, though do have a "TZ=US/Pacific date" shell alias for the local time. If you want not-Zulu times, things will be more involved than sticking a "Z" at the end of the string and hoping that the timezone is UTC or that the date is close enough.

    $ date2epoch 2024-05-17
    1715904000
    $ TZ=UTC date -r 1715904000 +%Y-%m-%dT%H:%M:%SZ
    2024-05-17T00:00:00Z
    $ TZ=US/Pacific date -r 1715904000 +%Y-%m-%dT%H:%M:%S%z
    2024-05-16T17:00:00-0700

The date(1) command may not be portable. There may be other options. As usual, the commands shown here assume OpenBSD—worse, OpenBSD plus whatever wacky scripts I've written.

    $ awk 'BEGIN{print strftime("%Y-%m-%dT%H:%M:%SZ",1715904000)}'
    2024-05-17T00:00:00Z
    $ corelist Time::Piece

    Data for 2024-04-27
    Time::Piece was first released with perl v5.9.5
    $ perl -MTime::Piece -E '$t=localtime 1715904000;say $t->datetime,"Z"'
    2024-05-17T00:00:00Z

Shell Script

If you already have "2024-05-17" from a gemfeed, then you only need to stick "T12:00:00Z" onto the end of it. So with various assumptions, a shell script may suffice to convert from gemfeed to Atom. This script doubtless needs more work, better error checking, for example. But probably it could suffice, and it relies only on basic unix tools. CPU use will be high due to the forks, nor is sh an efficient language, and the "while" form assumes that the last line ends with a newline, among other problems.

    #!/bin/sh
    # a pretty bad gemfeed to Atom converter
    title="TODO FIXME"
    base=gemini://example.org/blog/
    printf '<?xml version="1.0" encoding="UTF-8"?><feed xmlns="http://www.w3.org/2005/Atom"><title>%s</title><updated>%s</updated><link href="%s"></link>\n' "$title" `date +%Y-%m-%dT%H:%M:%SZ` "$base"
    # NOTE this assumes the "=> ..." link form; "=>..." is also legal
    link='=>'
    # NOTE no checking is done that the date is actually a date
    while read prefix path date rest; do
        if [ ! -z "$rest" -a "$prefix" = "$link" ]; then
            printf '<entry><title>'
            printf '%s' "$rest" |
              sed 's/</\&lt;/g;s/>/\&gt;/g;s/"/\&quot;/g;s/'\''/\&#39;/g'
            # NOTE this assumes the path does not contain an injection
            # of arbitrary XML!
            printf '</title><updated>%sT12:00:00Z</updated><link href="%s" rel="alternate"></link></entry>\n' "$date" "$base$path"
        fi
    done
    printf '</feed>\n'

It's a standard input filter, so use might look like:

    $ cat index.gmi
    foo
    => 2024/03/01/nul-in-filename.gmi 2024-03-01 Non-Terminal '\0' in Filenames
    => 2023/09/27/pidnull.gmi 2023-09-27 <>'" …
    bar
    $ sh g2atom < index.gmi
    <?xml version="1.0" encoding="UTF-8"?><feed xmlns="http://www.w3.org/2005/Atom"><title>TODO FIXME</title><updated>2024-05-18T16:15:59Z</updated><link href="gemini://example.org/blog/"></link>
    <entry><title>Non-Terminal &#39;0&#39; in Filenames</title><updated>2024-03-01T12:00:00Z</updated><link href="gemini://example.org/blog/2024/03/01/nul-in-filename.gmi" rel="alternate"></link></entry>
    <entry><title>&lt;&gt;&#39;&quot; …</title><updated>2023-09-27T00:00:00Z</updated><link href="gemini://example.org/blog/2023/09/27/pidnull.gmi" rel="alternate"></link></entry>
    </feed>

Another handy feature might be to stop after processing N entries, assuming that the most recent entries are posted at the top of the file.

Perl Script

This is less bad than the shell script, and should run on OpenBSD which includes perl (and Time::Piece) in the base system. There is more error checking, and only output if there are no errors. An even fancier version could write to a File::Temp file, and rename that file if all goes well for an atomic update on an atom.xml file. Error checking includes parsing the gemfeed time and checking that the path does not contain certain characters to avoid XSS issues, and that the path does not contain ".." which are often used in path traversal exploits. Even better would be to use a URI module and to emit canonical URL for the links (or to fail if there are problems doing that), but that would bring in Net::Gemini for gemini URL support, and we're aiming for minimum external software here.

    #!/usr/bin/env perl
    use 5.10.0;
    use Time::Piece;
    my $title = 'TODO FIXME';
    my $base  = 'gemini://example.org/blog/';
    my $dfmt  = '%Y-%m-%dT%H:%M:%S%z';
    my $now   = localtime->strftime($dfmt);
    my $s     = <<"EOH";
    <?xml version="1.0" encoding="UTF-8"?>
    <feed xmlns="http://www.w3.org/2005/Atom">
    <title>$title</title>
    <updated>$now</updated>
    <link href="$base"></link>
    EOH
    my $ecount = 0;

    while (readline) {
        if (m/^=>\s*(\S+)\s*(\S+)\s*(.+)/) {
            my ( $path, $date, $rest ) = ( $1, $2, $3 );
            next unless length $rest; # no title, no deal
            $rest =~ s{([<>"'])}{'&#'.ord($1).';'}eg;
            my $up;
            eval {
                $up = Time::Piece->strptime( $date, '%Y-%m-%d' )
                  ->strftime($dfmt);
                1;
            } or die "error: invalid date '$date' in '$_' at $ARGV:$.\n";
            die "error: problematic path '$path' in '$_' at $ARGV:$.\n"
              if $path =~ m/["]|\.\./;
            $s .= <<"EOE";
    <entry>
      <title>$rest</title>
      <updated>${date}T12:00:00Z</updated>
      <link href="$base$path" />
    </entry>
    EOE
            $ecount++;
        }
    } continue {
        close ARGV if eof;
    }
    die "error: no entries found\n" unless $ecount;
    print $s, "</feed>\n";

Also good would be to check that the output of your script can be consumed by a feed reader (or to validate the XML with some schema thing) especially if you are using minimal bespoke generation code that may not properly handle, I don't know, encodings besides whatever the default is on unix. The above scripts are mostly "garbage in, garbage out", and the input may not be UTF-8, nor may UTF-8 be handled correctly. But that's a cost of solutions that minimize pulling in dependencies.

A gemfeed entry, by contrast, is not much more than

    $ printf '=> %s %s %s\n' blah/ `TZ=UTC date +%Y-%m-%d` 'notepad > *'
    => blah/ 2024-05-17 notepad > *

and you don't need to worry about escaping & or about double-escaping something that is already escaped… but in that case, you probably want a library, or to drop support for XML.

Date Time Sidequest

It is better to test than to guess what exactly a piece of software does with invalid and almost valid dates such as 2025-02-29. Various options are possible, including to fail hilariously, to invent a new date, or to throw an error that you may have to manually check for nil or NULL or -1 or something.

    $ perl -MTime::Piece \
      -E 'say Time::Piece->strptime(qw[NOPE-02-29 %Y-%m-%d])'
    Error parsing time at ...
    $ perl -MTime::Piece \
      -E 'say Time::Piece->strptime(qw[2025-02-29 %Y-%m-%d])'
    Sat Mar  1 00:00:00 2025

This module follows the mktime(3) interface of assuming that a month day of 0 is the last day of the previous month, or that going past the end of a month by days takes you somewhere into the future. Other interfaces will throw an error. If you need an error thrown for a bad date (and humans are really good at typos), you'll need to use some other library. In theory this should not be a problem for my workflow, as blog posting dates are generated from the current time automatically (assuming that the system clock is correct, which it may not be). If you are typing in the date by hand, maybe you want to have something double-check that?

Another important point about strptime of year, month, day is that the hour, minute, and second fields may need to be manually filled in to suitable values, and also the timezone and whether Daylight Savings Time is in effect, or not. Those last two points raise "ugh, local timezones!" and make me prone to keep as much as possible in UTC. This may not be possible if you have customers who do want to use those wobbly local timezones.

Workflow

You probably want scripts that simply can be re-run when something fails, rather than having to manually clean up after who knows what happened. Atomicity and idempotency are fancy words here, or Guarded Commands. Also consider how "tightly coupled" the script or scripts are, as in if the XML stuff fails, how easy is it to temporarily disable that? Detecting and not creating duplicate entries might also be good.

Getting to a good workflow may require some combination of experience and fast iteration on prototypes, and maybe some luck (or misfortune) to have to spend time fixing some horrible problem in production.