💾 Archived View for dioskouroi.xyz › thread › 25004554 captured on 2020-11-07 at 00:54:01. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
I wrote an internal tool which is similar but for a more complicated use-case: synchronizing an hierarchical database into an hierarchy of json files, in a git repo, which are then automatically or manually committed by the user.
Our internal database is deeply hierarchical and also has a concept of a "folder" record, which users can create and nest. Users have complained for years that there was no change tracking. I surveyed some of the tools mentioned here (and other commercial ones) but they all just flattened the database into json or XML or just raw sql files, which made it very difficult for users to see the changes.
We also thought about getting rid of the database and running off of json files, but that would have required completely rewriting our already brittle internal ORM, which uses raw SQL in too many places for performance, and would have been a breaking change. I have a heavy dislike for rewriting or throwing away whole modules, and also dislike breaking changes, having been personally bitten by them many times in the past. And due to other constraints on my team this was unacceptable. It's still on the table, just not for now.
Instead I wrote a layer which translates back and forth between the ORM (or the raw database) and the json files in the file system. This is an addon to the application and is written in such a way that is mostly transparent to the users. Each change to a record in the database is translated into the correct json file on disk. Users can commit and push at their leisure. Users can also switch branches / reset to an earlier version, and request the database to be updated from the json files.
I had thought many times about open sourcing the basic part of this addon. But I was never sure how common this use-case is. Also, it's actually pretty trivial to write the base part. The hard parts were gluing it to the existing application/ORM, and working around the many edge cases and bugs in our existing application.
So, I wonder, how common is this use case (version control of an existing hierarchical SQL database in a user friendly way) in other organizations?
I would love to see this open sourced live the marriage of a database, git, and json files!
Could use an example of the database file and diffs.
I have vague recollections of a tool that was similarly supposed to make JSON files diffable. IIRC it procedurally built up the object, so {"foo": {"bar": "baz", "quux": [1, 2, 3]}} would be:
foo = {} foo.bar = "baz" foo.baz = [] foo.baz[0] = 1 ...
This is called gron, I prefer it for reading JSON with a lot of nesting.
https://github.com/tomnomnom/gron
This is really cool. Gives you much of what jq does without having to remember jq syntax.
This is amazing!
The example was a good idea, thanks! I added it
Augtool from the Augeas project too. (As well as the mentioned gron)
example
$ augtool -r . --noautoload --transform "Json.lns incl /sirdb-example.json" <<EOF > print /files/sirdb-example.json/ > EOF /files/sirdb-example.json /files/sirdb-example.json/dict /files/sirdb-example.json/dict/entry[1] = "_owner" /files/sirdb-example.json/dict/entry[1]/string = "nouser" /files/sirdb-example.json/dict/entry[2] = "username" /files/sirdb-example.json/dict/entry[2]/string = "nouser" /files/sirdb-example.json/dict/entry[3] = "email" .... etc ...
I find it unfortunate that augeas is so magical. Or rather, the design of its interface is very surprising
What's up with requiring "--noautoload"? What does "-r ." do? and the transform invocation starts with the seemingly magical "Json.lns" (where does it come from?) and appears to refer to a file at the root of the filesystem hierarchy?
I'd probably use augtool more often myself, but its UI is _seriously_ inscrutable.
I agree, fwiw.
I'm writing a little program to make this whole lot easier.
Answers: by default, augtool loads all the lenses and scans the entire default filesystem (typically /etc); `--noautoload` prevents that.
The -r is `--root`; specify the root as the current directory rather than /
Amusingly, this is almost identical to how HN’s database works. Or worked.
It wasn’t json, but it probably could’ve been.
I've definitely also written a big chunk of something like this (including JSON) for an internal tool.
Just didn't get to realising my folders were generic tables.
I'd go for YAML instead of JSON, or "JSON with trailing commas", or else the diffs become messy.
YAMLs biggest advantage over JSON: records can be appended (at the end of an existing file). Which you can't do with JSON objects, arrays etc because of the closing bracket at the end.
I generate bunch of JSON data from my machine learning runs (a bunch of outputs for analysis), so this fits my use case perfectly (inspecting changes of different run settings).
Although you already got my upvote for the use of _yer_... (I use it all the time and rarely see anyone using it, yagetme).
I've often looked for a simple human readable file-based DB-like system like this. Are there other competing packages like this?
I feel like many static site generators have more-or-less filesystem-as-database. Some are more specialized than others. Many SSGs have standardized around the format of `YAML: Metadata\n\n---\n\nContent`, and expose a query interface over the YAML metadata. But, most static site generators don't generally have create/update API - just a read API.
A reason you don't see much in the "tree of simple files as my DB" space is that the whole point of it is to be trivially simple to understand, and thus, to implement.
Now, with file DBs, it's actually quite important to use atomic writes, so that a crash or concurrent operations don't produce errors. For example, the package posted above can corrupt data if there's concurrent writers to the same data file, because all the writes are non-atomic fs.writeFileSync(). What this database should do instead is write to a tempfile, and then rename the tempfile to replace the destination. That way, you get simple last-write-wins semantics with no possibility of creating invalid JSONs.
This is something I wrote for a small electron application that needed some simple database functionality:
https://www.npmjs.com/package/minim-json-db
Like the OP, stores data locally in .json files, and uses a simple MongoDB-inspired API
Thanks! I'll add to related projects
There's always recutils [0] which uses a plain text database format stored in files called recfiles.
[0]
https://labs.tomasino.org/gnu-recutils/
I should add that I was inspired by recutils and considered using it before making this. The appeal of using common Unix tools to play with a database is a motivation for me
I just had a quick search, didn't turn up much but there is this
https://github.com/pksunkara/nosqlite
I could add a section
This is sort of in the space:
https://github.com/msiemens/tinydb
DBASE is largely text-based
git-diffable, and therefore versionable
You can get this behavior with a SQL database like mysql that has a text dump format. I"ve used this quite effectively to keep relatively small (but important) databases in a git repo.
I can definitely appreciate the advantages of an export format that is explicitly designed to support this (e.g. outputting each field on its own line, perhaps using an explicit label in front of it) but even the out-of-the-box format can often be quite effective.
what's the use case? certainly the performance impact will suffer at scale, so what's the usable limit before disadvantages outweigh the advantages of being able to use git?
I don't know about performance benchmarks to other DBs or scale, but I ran read and write benchmarks and saw 1000s op/s and felt satisfied enough. It won't be faster than filesystem unless you mount the DB directory to a RAM disk.
For my use case, speed was not the most important, instead it was using common tools on the DB, being human-readable, in JSON and git diff-able. Also, I wanted small resource usage and small library size.
I personally feel like the performance of DBs for smaller projects is a bit of a stale topic.
With SSDs at 2GBps and over, that’s on par with early server RAM and there were companies who were able to serve pretty significant loads with 2GBps RAM.
Slightly off topic - can we use git-diff-able databases for faster syncing of clients with servers?
Or is there no practical difference?
We can return just the latest updated documents by maintaining a lastUpdated timestamp on client and server docs. But this would be at a document level. So, only the fields that have changed would be sent to the client.
An mutch easier solution that gets you 90% of the way is to just use a binary patcher (like bsdiff) with whatever database format you're already using.
bsdiff doesn't scale to gigabytes-- vcdiff or rsync will work better
I'm not sure if it would be faster or less data but this is certainly diffable at the document>-field level. Each field gets its own line at least
Never trust a clock.
Yeah wall clock isn't reliable. But if the server could send a global version/logical clock, then the client can say this is the latest version I've got. Which is essentially what tcp, git, etc do.
How about a version chain?
I wrote an NPM package to convert JSON objects to/from a dotted format. It doesn't (currently) sort the flattened output, but that would be easy to add.
https://www.npmjs.com/package/flattenjs
Is sqlite not diffable?
I wrote a tool that dumps a SQLite database out to disk as diffable JSON, precisely so you can easily record changes made to the database over time:
https://github.com/simonw/sqlite-diffable
I have a demo of it running against my blog's database here:
https://github.com/simonw/simonwillisonblog-backup
Thanks for this! I am trying to integrate something similar into my custom CMS.
I would export the DB and use that for your diff/version control:
sqlite3 yourdb.sqlite .dump
If you really want to track the binary e.g. inside git, but still see textual diffs you would put that into your .gitconfig:
[diff "sqlite3"] textconv = sqlite3 $1 .dump
To expand on this, you can now even export to markdown and basically have a nice overview of the database. I use this in my notes which are all written in markdown because markdown tables are painful to organise per hand.
SQLite database files themselves are binary, which typically isn't diff-able with standard tools, so (eg) git wouldn't show you anything useful unless you configure a special tool for diffing sqlite files.
And in fact, there is a sqlitediff.exe tool for diffing sqlite databases:
https://sqlite.org/sqldiff.html
But, it would still be a bummer to store them in git because git won't efficiently track changes between sqlite databases, it'll just add the whole binary blob on every commit. oof.
This might be a silly question, but is there a reason why git can't also diff binary files? If I have a huge binary file and I change a few bytes in the middle, what's stopping Git just checking that in as a diff? I don't see what's so special about text...
[edit] just did some research and it looks like Git _will_ store the delta of a binary file in its packfiles, just like a text file. The question is just how delta-able sqlite binary files are.
Diffing is about more than just efficient storage though, it's also about showing a history of human-readable changes (which obv you don't get with a binary format).
That's a good point, thank you
I know Drew from Dropbox implemented a binary diff for his first version (on a bus) but I don't know the details. I don't know if git can somehow...but I suppose you could always MIME encode it (like email base64 encoding).
AFAIR Sqlite files are normally in append mode.
Wow this was something I was looking for...Add a hireme button to your README.md so you can get hired using wannahireme.com...full disclosure I made the site.