💾 Archived View for soviet.circumlunar.space › ~hektor › dupver.gmi captured on 2024-08-25 at 00:26:05. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Deduplicating Binary Version Control with Dupver

I deal a lot with large-ish binary file in my work. Typically these are single-file databases in the style of Sqlite3. These aren't ideal for putting in Git. Git doesn't scale well to large repository sizes. People always seem to try framing this as not a limitation, but it is. I've taken to using Boar, Restic and finally the Duplicacy deduplicating backup software packages. I think at some point someone had mentioned that it would be awesome to have something like Git but that replaces the packfile format with deduplicating storage. And so I got started scratching my own itch. You can find it on my GitHub here:

https://github.com/akbarnes/dupver

Dupver is a minimalist deduplicating version control system in Go based on the Restic chunking library. It is most similar to the binary version control system Boar:

https://bitbucket.org/mats_ekberg/boar/wiki/Home

Dupver does not track files, rather it stores snapshots more like a backup program. Rather than traverse directories itself, Dupver uses an (uncompressed) tar file as input. Not that only tar files are accepted as input as Dupver relies on the tar container to provide the list of files in the snapshot and store metadata such as file modification times and permissions Dupver uses a centralized repository to take advantage of deduplication between working directories. This means that dupver working directories can also be git repositories or subdirectories of git repositories. I mainly use it for version control of databases, but it can also be used for sampled data.

Art, 2020-10-24 07:41

Update 2021-01-02: Some Thoughts on the Future of Version Control

In response to a HN post, some thoughts about the future for version control. Full discussion is here: https://news.ycombinator.com/item?id=25535844

The state-of-the-art for backup is deduplicating software (Borg, Restic, Duplicacy). Gripes about Git's UI choices aside, Git was designed around human-readable text files and just doesn't do large binary files well. Sure, there's Git-LFS, but it sucks. The future of version control will:

Make use of deduplication to handle large binary files Natively supports remotes via cloud storage Doesn't keep state in the working directory so that projects can live in a Dropbox/OneDrive/iCloud folder without corrupting the repo Is truly cross-platform with minimal POSIX dependencies. I love Linux, but I'm a practicing engineer, and the reality is that engineering software is a market where traditional Windows desktop software still rules. Another thought I've been having for some time is if I could have gotten away with file level deduplication like Boar (or Git IIRC) does and drop compression. This would probably result in significant simplification, particularly for copying between repos. For most users this wouldn't impact disk space usage much as the bulk of files already have compression built in, and the trend seems to be increasingly to adopt compression in new file formats. This includes:

Audio/Image/Video files with (usually) lossy compression. This suprisingly (to me) also includes raster image editor file formats such as Paint.net's pdn, which wraps everything in a gzip stream. MS office documents structured as a hierarchy of zipped .xml files. More recently, this format also includes Matlab's .slx Simulink file format and .mlx notebook format. The gotcha to this is it's an 80% solution. There are still plenty of file formats that are uncompressed text, even newer ones such as JSON/YAML/TOML and a number of uncompressed binary file formats such as MessagePack, though most tend to be some sort of database such as the Geodatabase .gdb format which is based on Sqlite3 or PowerWorld's .pwb format. There is also the corner case of metadata in media files such as EXIF, which if modified would cause the whole file contents to be stored again. So I'm sticking with chunking for the time being.

Update 2021-03-07

The past couple weeks have brought a number of updates to dupver. The big ones are:

1. Added JSON output for commands. This will make using dupver with a GUI frontend easier. It also makes make it easier to use with object shells like NuShell or PowerShell

2. Added global preferences to specify an editor, diff tool and default repository path. This currently spawns an asynchronous process and doesn't support optional arguments so it won't work with terminal-mode editors or MacOS

3. Fixed a bug in copy causing it to not copy the last chunk

If you want to try out the new JSON output, here are a couple examples

NuShell

dupver -j log | from json
dupver -j log <commmit_id> | from json
dupver -j status | from json
dupver -jv status | from json

PowerShell

dupver -j log | ConvertFrom-JSON | Format-Table
dupver -j log <commmit_id> | ConvertFrom-JSON | Format-Table
dupver -j status | ConvertFrom-JSON | Format-Table
dupver -jv status | ConvertFrom-JSON | Format-Table

Update 2022-01-11

A new year brings a new version of Dupver, 2.0.1!

As of 1.0.0, it traverses directories instead of reading from an intermediate tar file. This complicates the program somewhat, but is much faster when dealing with incremental updates to working directories. Some features have been stripped out, including json output and centralized repositories shared between projects. The loss of json output isn't too big a deal, as json is still used as the dominant data format within the repository, so third-party scripts can read directly from the repository rather than calling the dupver executable. Projects now have a distributed repository, git-style. With the exception of the head commit pointer `head.json`, files are not modified inside the repository, so dupver is expected to play well with folder synchronization software.

As of 2.0.0, dupver now stores commit files in an ordered list as opposed to a dictionary. This allows for files to be read from the repository in the order that they were committed, which supports the new `repack` command. The `repack` command will create a new copy of the `trees` and `packs` subfolders with small pack files from the end of each commit consolidated. Additionally, it will not copy chunks that do not have snapshots associated with them, allowing for pruning of snapshots.

A question that has come up is whether to support special handling of archive files so that individual files within the archive could be decompressed and chunked. This will add a fair bit of complexity and require changing the repository format again, so I'm holding off for now and focusing my efforts on improving the docs, adding more unit tests, and benchmarking.