💾 Archived View for dioskouroi.xyz › thread › 25004554 captured on 2020-11-07 at 00:54:01. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Show HN: Sir – Git-diff-able JSON database on yer filesystem

Author: graderjs

Score: 112

Comments: 49

Date: 2020-11-06 03:19:32

Web Link

________________________________________________________________________________

olvy0 wrote at 2020-11-06 08:23:49:

I wrote an internal tool which is similar but for a more complicated use-case: synchronizing an hierarchical database into an hierarchy of json files, in a git repo, which are then automatically or manually committed by the user.

Our internal database is deeply hierarchical and also has a concept of a "folder" record, which users can create and nest. Users have complained for years that there was no change tracking. I surveyed some of the tools mentioned here (and other commercial ones) but they all just flattened the database into json or XML or just raw sql files, which made it very difficult for users to see the changes.

We also thought about getting rid of the database and running off of json files, but that would have required completely rewriting our already brittle internal ORM, which uses raw SQL in too many places for performance, and would have been a breaking change. I have a heavy dislike for rewriting or throwing away whole modules, and also dislike breaking changes, having been personally bitten by them many times in the past. And due to other constraints on my team this was unacceptable. It's still on the table, just not for now.

Instead I wrote a layer which translates back and forth between the ORM (or the raw database) and the json files in the file system. This is an addon to the application and is written in such a way that is mostly transparent to the users. Each change to a record in the database is translated into the correct json file on disk. Users can commit and push at their leisure. Users can also switch branches / reset to an earlier version, and request the database to be updated from the json files.

I had thought many times about open sourcing the basic part of this addon. But I was never sure how common this use-case is. Also, it's actually pretty trivial to write the base part. The hard parts were gluing it to the existing application/ORM, and working around the many edge cases and bugs in our existing application.

So, I wonder, how common is this use case (version control of an existing hierarchical SQL database in a user friendly way) in other organizations?

lootsauce wrote at 2020-11-06 16:27:08:

I would love to see this open sourced live the marriage of a database, git, and json files!

andreareina wrote at 2020-11-06 05:01:46:

Could use an example of the database file and diffs.

I have vague recollections of a tool that was similarly supposed to make JSON files diffable. IIRC it procedurally built up the object, so {"foo": {"bar": "baz", "quux": [1, 2, 3]}} would be:

foo = {}
    foo.bar = "baz"
    foo.baz = []
    foo.baz[0] = 1
    ...

PowerfulWizard wrote at 2020-11-06 05:17:06:

This is called gron, I prefer it for reading JSON with a lot of nesting.

https://github.com/tomnomnom/gron

divbzero wrote at 2020-11-06 06:39:58:

This is really cool. Gives you much of what jq does without having to remember jq syntax.

robador wrote at 2020-11-06 05:50:15:

This is amazing!

graderjs wrote at 2020-11-06 06:26:47:

The example was a good idea, thanks! I added it

emmelaich wrote at 2020-11-06 06:43:16:

Augtool from the Augeas project too. (As well as the mentioned gron)

emmelaich wrote at 2020-11-06 06:49:56:

example

    $ augtool -r . --noautoload --transform "Json.lns incl /sirdb-example.json" <<EOF
   > print /files/sirdb-example.json/
   > EOF
   /files/sirdb-example.json
   /files/sirdb-example.json/dict
   /files/sirdb-example.json/dict/entry[1] = "_owner"
   /files/sirdb-example.json/dict/entry[1]/string = "nouser"
   /files/sirdb-example.json/dict/entry[2] = "username"
   /files/sirdb-example.json/dict/entry[2]/string = "nouser"
   /files/sirdb-example.json/dict/entry[3] = "email"
   .... etc ...

chousuke wrote at 2020-11-06 15:45:43:

I find it unfortunate that augeas is so magical. Or rather, the design of its interface is very surprising

What's up with requiring "--noautoload"? What does "-r ." do? and the transform invocation starts with the seemingly magical "Json.lns" (where does it come from?) and appears to refer to a file at the root of the filesystem hierarchy?

I'd probably use augtool more often myself, but its UI is _seriously_ inscrutable.

emmelaich wrote at 2020-11-06 23:11:45:

I agree, fwiw.

I'm writing a little program to make this whole lot easier.

Answers: by default, augtool loads all the lenses and scans the entire default filesystem (typically /etc); `--noautoload` prevents that.

The -r is `--root`; specify the root as the current directory rather than /

sillysaurusx wrote at 2020-11-06 04:17:42:

Amusingly, this is almost identical to how HN’s database works. Or worked.

It wasn’t json, but it probably could’ve been.

XorNot wrote at 2020-11-06 05:59:37:

I've definitely also written a big chunk of something like this (including JSON) for an internal tool.

Just didn't get to realising my folders were generic tables.

dheera wrote at 2020-11-06 06:17:30:

I'd go for YAML instead of JSON, or "JSON with trailing commas", or else the diffs become messy.

jeswin wrote at 2020-11-06 06:26:37:

YAMLs biggest advantage over JSON: records can be appended (at the end of an existing file). Which you can't do with JSON objects, arrays etc because of the closing bracket at the end.

dijksterhuis wrote at 2020-11-06 08:54:04:

I generate bunch of JSON data from my machine learning runs (a bunch of outputs for analysis), so this fits my use case perfectly (inspecting changes of different run settings).

Although you already got my upvote for the use of _yer_... (I use it all the time and rarely see anyone using it, yagetme).

yodon wrote at 2020-11-06 04:01:01:

I've often looked for a simple human readable file-based DB-like system like this. Are there other competing packages like this?

jitl wrote at 2020-11-06 04:54:05:

I feel like many static site generators have more-or-less filesystem-as-database. Some are more specialized than others. Many SSGs have standardized around the format of `YAML: Metadata\n\n---\n\nContent`, and expose a query interface over the YAML metadata. But, most static site generators don't generally have create/update API - just a read API.

A reason you don't see much in the "tree of simple files as my DB" space is that the whole point of it is to be trivially simple to understand, and thus, to implement.

Now, with file DBs, it's actually quite important to use atomic writes, so that a crash or concurrent operations don't produce errors. For example, the package posted above can corrupt data if there's concurrent writers to the same data file, because all the writes are non-atomic fs.writeFileSync(). What this database should do instead is write to a tempfile, and then rename the tempfile to replace the destination. That way, you get simple last-write-wins semantics with no possibility of creating invalid JSONs.

leoncvlt wrote at 2020-11-06 10:10:51:

This is something I wrote for a small electron application that needed some simple database functionality:

https://www.npmjs.com/package/minim-json-db

Like the OP, stores data locally in .json files, and uses a simple MongoDB-inspired API

graderjs wrote at 2020-11-06 10:24:57:

Thanks! I'll add to related projects

asaibx wrote at 2020-11-06 04:34:31:

There's always recutils [0] which uses a plain text database format stored in files called recfiles.

[0]

https://labs.tomasino.org/gnu-recutils/

graderjs wrote at 2020-11-06 04:44:34:

I should add that I was inspired by recutils and considered using it before making this. The appeal of using common Unix tools to play with a database is a motivation for me

graderjs wrote at 2020-11-06 04:30:56:

I just had a quick search, didn't turn up much but there is this

https://github.com/pksunkara/nosqlite

I could add a section

LVB wrote at 2020-11-06 04:29:48:

This is sort of in the space:

https://github.com/msiemens/tinydb

legulere wrote at 2020-11-06 07:25:55:

DBASE is largely text-based

gumby wrote at 2020-11-06 20:00:35:

git-diffable, and therefore versionable

You can get this behavior with a SQL database like mysql that has a text dump format. I"ve used this quite effectively to keep relatively small (but important) databases in a git repo.

I can definitely appreciate the advantages of an export format that is explicitly designed to support this (e.g. outputting each field on its own line, perhaps using an explicit label in front of it) but even the out-of-the-box format can often be quite effective.

alilsumpinditto wrote at 2020-11-06 05:36:11:

what's the use case? certainly the performance impact will suffer at scale, so what's the usable limit before disadvantages outweigh the advantages of being able to use git?

graderjs wrote at 2020-11-06 06:35:53:

I don't know about performance benchmarks to other DBs or scale, but I ran read and write benchmarks and saw 1000s op/s and felt satisfied enough. It won't be faster than filesystem unless you mount the DB directory to a RAM disk.

For my use case, speed was not the most important, instead it was using common tools on the DB, being human-readable, in JSON and git diff-able. Also, I wanted small resource usage and small library size.

joshspankit wrote at 2020-11-06 20:30:29:

I personally feel like the performance of DBs for smaller projects is a bit of a stale topic.

With SSDs at 2GBps and over, that’s on par with early server RAM and there were companies who were able to serve pretty significant loads with 2GBps RAM.

dreamer7 wrote at 2020-11-06 04:27:53:

Slightly off topic - can we use git-diff-able databases for faster syncing of clients with servers?

Or is there no practical difference?

We can return just the latest updated documents by maintaining a lastUpdated timestamp on client and server docs. But this would be at a document level. So, only the fields that have changed would be sent to the client.

moonchild wrote at 2020-11-06 07:43:45:

An mutch easier solution that gets you 90% of the way is to just use a binary patcher (like bsdiff) with whatever database format you're already using.

Scaevolus wrote at 2020-11-06 09:51:20:

bsdiff doesn't scale to gigabytes-- vcdiff or rsync will work better

graderjs wrote at 2020-11-06 04:32:01:

I'm not sure if it would be faster or less data but this is certainly diffable at the document>-field level. Each field gets its own line at least

zmj wrote at 2020-11-06 05:09:09:

Never trust a clock.

andreareina wrote at 2020-11-06 06:16:26:

Yeah wall clock isn't reliable. But if the server could send a global version/logical clock, then the client can say this is the latest version I've got. Which is essentially what tcp, git, etc do.

joshspankit wrote at 2020-11-06 20:32:15:

How about a version chain?

richie5um wrote at 2020-11-06 07:58:07:

I wrote an NPM package to convert JSON objects to/from a dotted format. It doesn't (currently) sort the flattened output, but that would be easy to add.

https://www.npmjs.com/package/flattenjs

codenesium wrote at 2020-11-06 04:30:48:

Is sqlite not diffable?

simonw wrote at 2020-11-06 06:17:26:

I wrote a tool that dumps a SQLite database out to disk as diffable JSON, precisely so you can easily record changes made to the database over time:

https://github.com/simonw/sqlite-diffable

I have a demo of it running against my blog's database here:

https://github.com/simonw/simonwillisonblog-backup

jlelse wrote at 2020-11-06 08:19:53:

Thanks for this! I am trying to integrate something similar into my custom CMS.

Jakob wrote at 2020-11-06 05:36:52:

I would export the DB and use that for your diff/version control:

  sqlite3 yourdb.sqlite .dump

If you really want to track the binary e.g. inside git, but still see textual diffs you would put that into your .gitconfig:

  [diff "sqlite3"]
    textconv = sqlite3 $1 .dump

djeiasbsbo wrote at 2020-11-06 07:07:24:

To expand on this, you can now even export to markdown and basically have a nice overview of the database. I use this in my notes which are all written in markdown because markdown tables are painful to organise per hand.

jitl wrote at 2020-11-06 05:00:23:

SQLite database files themselves are binary, which typically isn't diff-able with standard tools, so (eg) git wouldn't show you anything useful unless you configure a special tool for diffing sqlite files.

And in fact, there is a sqlitediff.exe tool for diffing sqlite databases:

https://sqlite.org/sqldiff.html

But, it would still be a bummer to store them in git because git won't efficiently track changes between sqlite databases, it'll just add the whole binary blob on every commit. oof.

framecowbird wrote at 2020-11-06 07:37:37:

This might be a silly question, but is there a reason why git can't also diff binary files? If I have a huge binary file and I change a few bytes in the middle, what's stopping Git just checking that in as a diff? I don't see what's so special about text...

[edit] just did some research and it looks like Git _will_ store the delta of a binary file in its packfiles, just like a text file. The question is just how delta-able sqlite binary files are.

GordonS wrote at 2020-11-06 09:42:52:

Diffing is about more than just efficient storage though, it's also about showing a history of human-readable changes (which obv you don't get with a binary format).

graderjs wrote at 2020-11-07 01:38:55:

That's a good point, thank you

graderjs wrote at 2020-11-06 07:43:08:

I know Drew from Dropbox implemented a binary diff for his first version (on a bus) but I don't know the details. I don't know if git can somehow...but I suppose you could always MIME encode it (like email base64 encoding).

andrewshadura wrote at 2020-11-06 10:11:35:

AFAIR Sqlite files are normally in append mode.

uvic wrote at 2020-11-06 05:28:34:

Wow this was something I was looking for...Add a hireme button to your README.md so you can get hired using wannahireme.com...full disclosure I made the site.