💾 Archived View for thrig.me › blog › 2024 › 12 › 12 › bdb.gmi captured on 2024-12-17 at 10:40:55. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

Berkeley DB

Berkeley DB (BDB) is old, though has the advantage of being fairly simple to use and is generally available by default (or easily installed via some ports or package system). The disadvantages are many, such as Subversion trying to use BDB instead of a filesystem-based database like they do now, or the lack of features as one can find in an also memory- or filesystem-based sqlite database. SQLite on the other hand is slightly more complicated, though likely is a better option especially if you have configuration management that handles bootstrapping the various database files with the appropriate schema. Humans are pretty good at skipping steps or forgetting to do things, so one may want a system that "just works" with as little meddling as possible.

How simple is BDB?

    $ perl -MDB_File -e 'tie %f,DB_File => "f";$f{bar} = 42'
    $ file f
    f: Berkeley DB 1.85 (Hash, version 2, native byte-order)
    $ perl -MDB_File -E 'tie %f,"DB_File","f";say "$_ $f{$_}" for keys %f'
    bar 42
    $ file /etc/mail/aliases.db 
    /etc/mail/aliases.db: Berkeley DB 1.85 (Hash, version 2, native byte-order)

Multiple Values

Simple key value store. What more could you possibly need? DB_File cannot store references, or rather for Perl will store the string form of the reference's memory address, which is generally not very useful.

    $ perl -MDB_File -e 'tie %f,DB_File => "f";$f{bar} = [ "Goblin", 8 ]' 
    $ perl -MDB_File -E 'tie %f,"DB_File","f";say "$_ $f{$_}" for keys %f'
    bar ARRAY(0x49044f357d8)

One workaround (as opposed to using a fancier database system) is to encode the values into a string (comma separated values) or packed scalar and then to parse or decode the encoded value when reading records back. A comma might be bad if the data may contain commas, as then you either need to escape (and unescape) only the right commas, or maybe instead use the NUL character as the delimiter, which suffers from the same problem as the comma if the data can have '\0' in it.

String encode and parse

    use 5.36.0;
    sub encode(@values) { join "\0", @values }
    sub decode($value) { split "\0", $value }

    my $e = encode(qw{foo bar 42});
    printf "%vx\n", $e;
    say for decode($e);

Better code might check that none of the values to encode have NUL in them, and either throw an error or encode it somehow; encoding it somehow will of course complicate the decode code. Much here depends on what exactly will be stored and whether what is being stored comes from untrusted sources: the network, users being users, hostile attackers, buggy code, etc. A simple join or split on a comma may well work if the data is known and well controlled, and neither paying customers nor regulations are anywhere near the process. Goblin attack tables for the casual game night? Okay. Patient data records from remote providers? Nope, use something better.

Pack

Packed values are something like structures in C, where "numbers" get stuffed into bit buckets of some length, sign, and byte order, netstrings (length + string data) can be used, or so forth. Byte order may be irrelevant given that the above BDB files use "native byte-order" according to file(1), so if you need raw file portability between little- and big-endian systems, you may need some other database. Another concern is "native" integer sizes, as one might encounter moving Mailman from a 32-bit to a 64-bit system, which involved, for me, writing some Perl scripts to export all the mailing lists and then re-create and re-populate the lists on the new system. That's a long way to say you should specify integer sizes exactly rather than relying on whatever the system happens to provide: use an exact size and sign, e.g. "int8_t" (signed integer, 8 bits) rather than whatever "char" happens to be.

    use 5.36.0;
    sub encode( $race, $hp ) { pack 'Ca*c', length($race), $race, $hp }
    sub decode($value) {
        my $len = unpack C => $value;
        warn "L $len\n";
        my ( $race, $hp ) = unpack "x[C]a${len}c", $value;
        return $race, $hp;
    }

    my $e = encode( "Goblin", 8 );
    printf "%vx\n", $e;
    my ( $race, $hp ) = decode($e);
    say for $race, $hp;

Of course you may want bounds checking so that when a "number" comes along that is too big, or a string that is too large, you get an error rather than... well, whatever it is that happens when something too big gets stuffed into a bit bucket that is too small. This risk goes up, a lot, when untrusted input is involved. With the above code one might check that "length($race)" is no larger than the "C" pack template, which is an unsigned char (octet) value.

Or maybe you instead want a serialization library to handle such details for you?

Serialization

There are serialization modules on CPAN, or you can Data::Dumper and then pray that the "eval" of the decode does not come along with a giant heap of arbitrary code execution. I'm not a fan of serialization systems that risk security vulnerabilities, though some programmers are.

Recently, on the "Ask The Architect" session from the Devoxx UK 2018 conference, Oracle's chief architect, Mark Reinhold, shared his thoughts about Java’s serialization mechanism which he called a “horrible mistake” and a virtually endless source of security vulnerabilities.

https://www.securityinfowatch.com/cybersecurity/information-security/article/12420169/oracle-plans-to-end-java-serialization-but-thats-not-the-end-of-the-story

Locking

Roll your own? flock(2) comes to mind. SQLite is likely a better option if there are lots of readers and writers, that is, for when the database is more than merely trivial. Probably I'd switch to sqlite if multiple scripts need access to the same database, and moreso if multiple people manage the system. Otherwise, you'd need to invent an API and somehow get everything to follow that, and then it only takes one thing doing the wrong thing once to make everything fall apart in some hard to debug way that will wake the sysadmin (typically, me) up at some horrible hour of the night.

/blog/2023/07/24/only-one-script.gmi