💾 Archived View for hektor.flounder.online › dupver.gmi captured on 2022-03-01 at 15:42:56. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-11-30)
-=-=-=-=-=-=-
I deal a lot with large-ish binary file in my work. Typically these
are single-file database in the style of Sqlite3. These aren't ideal
for putting in Git. Git doesn't scale well to large repository sizes.
People always seem to try framing this as not a limitation, but it is.
I've taken to using Boar, Restic and finally the Duplicacy
deduplicating backup software packages. I think at some point someone
had mentioned that it would be awesome to have something like Git but
that replaces the packfile format with deduplicating storage. And so I
got started scratching my own itch. You can find it on my GitHub here:
https://github.com/akbarnes/dupver
Dupver is a minimalist deduplicating version control system in Go
based on the Restic chunking library. It is most similar to the binary
version control system Boar:
https://bitbucket.org/mats_ekberg/boar/wiki/Home
Dupver does not track files, rather it stores snapshots more like a
backup program. Rather than traverse directories itself, Dupver uses
an (uncompressed) tar file as input. Not that only tar files are
accepted as input as Dupver relies on the tar container to provide the
list of files in the snapshot and store metadata such as file
modification times and permissions Dupver uses a centralized
repository to take advantage of deduplication between working
directories. This means that dupver working directories can also be
git repositories or subdirectories of git repositories. I mainly use
it for version control of databases, but it can also be used for
sampled data.
Art, 2020-10-24 07:41
In response to a HN post, some thoughts about the future for version
control. Full discussion is here:
https://news.ycombinator.com/item?id=25535844
The state-of-the-art for backup is deduplicating software (Borg,
Restic, Duplicacy). Gripes about Git's UI choices aside, Git was
designed around human-readable text files and just doesn't do large
binary files well. Sure, there's Git-LFS, but it sucks. The future of
version control will:
Make use of deduplication to handle large binary files Natively
supports remotes via cloud storage Doesn't keep state in the working
directory so that projects can live in a Dropbox/OneDrive/iCloud
folder without corrupting the repo Is truly cross-platform with
minimal POSIX dependencies. I love Linux, but I'm a practicing
engineer, and the reality is that engineering software is a market
where traditional Windows desktop software still rules. Another
thought I've been having for some time is if I could have gotten away
with file level deduplication like Boar (or Git IIRC) does and drop
compression. This would probably result in significant simplification,
particularly for copying between repos. For most users this wouldn't
impact disk space usage much as the bulk of files already have
compression built in, and the trend seems to be increasingly to adopt
compression in new file formats. This includes:
Audio/Image/Video files with (usually) lossy compression. This
suprisingly (to me) also includes raster image editor file formats
such as Paint.net's pdn, which wraps everything in a gzip stream. MS
office documents structured as a hierarchy of zipped .xml files. More
recently, this format also includes Matlab's .slx Simulink file format
and .mlx notebook format. The gotcha to this is it's an 80% solution.
There are still plenty of file formats that are uncompressed text,
even newer ones such as JSON/YAML/TOML and a number of uncompressed
binary file formats such as MessagePack, though most tend to be some
sort of database such as the Geodatabase .gdb format which is based on
Sqlite3 or PowerWorld's .pwb format. There is also the corner case of
metadata in media files such as EXIF, which if modified would cause
the whole file contents to be stored again. So I'm sticking with
chunking for the time being.