💾 Archived View for thrig.me › blog › 2024 › 10 › 01 › flatfile-database.gmi captured on 2024-12-17 at 10:42:52. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
gemini://hanzbrix.pollux.casa/gemlog/20241001.gmi
Flatfile databases have been around since the days of the terminals and tapedecks, we of course quickly went away from it, once we had relational databases, with indexes, caching in ram and other fancy hacks to deliver data at lightning speed. In my opinion, we mainly moved to relational databases, to escape the limitations of spinning rust (mechanical hard drives), but as time went on, we got solid state storage and later solid state storage connected directly to the pc backbone, hard drive speeds isn't exactly a limitation anymore.
Flatfiles are slow, even when cached in memory (which for anything frequent will pretty much be the case) as a linear scan for the 27,879th user record in /etc/passwd takes time, more than something with a B-tree index or whatever. Hence the flatfile database /etc/passwd (optionally) being able to be built into a Berkeley DB file, among other complications such as LDAP. So flatfiles can work, but may not scale. If your data does scale, then it may be too slow, fail, or you'll have to implement something more efficient into the workflow. (Berkeley DB can also fail, as Subversion discovered prior to moving to a filesystem-based database, probably a case of needing to use the right tool for the job. One story I heard was that the BDB folks said something like "don't use BDB like that".)
Another problem with flat files involves atomicity. flock(2) applies to only the whole file, and could be ignored by a poorly written tool; a database usually will present an API that allows multiple actors to play together, and may have row-level locking so that a whole table or file need not be locked for a row update. You could write a library that offers row-level locking to flat files, but at that point why not just use sqlite, which has lots of libraries and other such goodies already built around it? If it's only you editing the files and mostly not reading at the same time, then atomicity may not matter, or you can rename(2) the (hopefully small) file into place after rewriting it.
There's not really a standard for flatfiles (or too many: JSON, recfile, wikis, other? Wikipedia lists Berkeley DB in the flatfile article, which is a bit strange to me, as I put Berkeley DB over closer to sqlite: database file, API to access it) and you may end up with something homegrown; with a database you can probably find or hire people who know at least a little about sqlite or SQL, and it's probably not too hard to crosstrain into Postgres or some other flavor of database or SQL. Documentation, training, backups, integrity, software library support, redundancy, scalability—these all can point to databases being a good default choice.
https://www.gnu.org/software/recutils/
Most of my data is not in a database, as it's mainly text and music blobs in filesystems, mail in mailbox files, and so forth. But I don't have much data nor much need to organize or search it using a database (at the cost of sometimes having to poke around with grep or `git grep` and maybe to not find things). This blog was at one point backed by sqlite, but I rewrote it into a different tool that injects a file into the right place and hopefully updates all the different index and feed files aright. There can be tradeoffs as to where the complexity ends up and how much trouble there is when something fails, what about backups, etc. The OpenBSD man pages also got rid of a sqlite dependency: too much code to review, and whatever apropos(1) and makewhatis(8) do could be handled using base libraries. So there are cases where you can get away with something custom. If it makes sense. Maybe do some planning in advance of implementing whatever, put together a prototype to see if the idea flies, and consider the tradeoffs involved? Does your homegrown filesystem thing use more or less energy than a database, is the custom implementation worth it, etc.
$ apropos ^ | wc -l 14143 $ time apropos ^ | wc -l 14143 0m00.07s real 0m00.02s user 0m00.06s system $ time apropos ^ | wc -l 14143 0m00.07s real 0m00.05s user 0m00.04s system $ time apropos ^ | wc -l 14143 0m00.07s real 0m00.02s user 0m00.05s system
That's on a pretty fast system with a SSD, and memory caching is involved. I'm guessing it isn't terribly slow on older CPU or when spinny metal disks are involved (and if you're running OpenBSD on such you know what you signed up for), and most searches aren't going to be looking for everything. On the other hand, the code is very specialized to OpenBSD man pages, so there's no generic SQL interface. If queries via SQL are going to be a thing, then you'd probably want to design for that. Not that I've kept up on what data query folks are using these days—the bill for Cognos looked pretty spendy and they hired someone dedicated to that role the last time I was anywhere near such work.