💾 Archived View for gemini.complete.org › an-asynchronous-rsync-with-dar captured on 2024-09-28 at 23:56:03. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2024-07-09)
-=-=-=-=-=-=-
In my writing about dar[1], I recently made that point that dar is a filesystem differ and patcher.
We can exploit this property to do something really cool: build an Asynchronous[2] rsync[3]. What does that mean?
2: /asynchronous-communication/
rsync is a tool that has been in may *nix admins' toolboxes for years. Typically used over ssh, rsync will compare the state of a local directory tree (or file) to the state of a remote tree, and efficiently make the remote match the local (or vice-versa). It does this by comparing metadata on files, and sending efficient binary deltas to reflect changes.
But of course, rsync is synchronous; that is, it must have real-time direct access between the local and the remote. rsync cannot use USB drive as transport, for instance. (OK, yes, it has a `--write-batch` option, but it is only used to duplicate a set of changes to another remote that is in an identical starting state to one that rsync can reach.)
As I said, dar is a filesystem differ. It also have binary delta capabilities. And it turns out we can use it as an asynchronous-capable rsync.
What would an asynchronous synchronization algorithm look like? I'd suggest these general steps:
1.
Obtain metadata about the destination. Save this in a file and transfer it to the source machine.
2.
On the source, compare the source state with the metadata about the destination. Generate a file with commands to make the destination match the source state. Send that file to the destination.
3.
Apply the commands to the destination.
Let's see how this can work with dar.
Let's say we're going to put something in `/tmp/test` on the remote. I'm going to simulate synchronizing `/usr/local/bin` from one machine to another. So:
$ cp -r /usr/local/bin /tmp/test
Here's one way to save off the metadata:
dar --create - \ --fs-root /tmp/test \ --delta sig \ | dar --sequential-read \ --ref - \ --isolate /tmp/destmeta \ --compression=zstd \ --delta sig
Let's step through this. In my post about using dar for archiving[4], I used `--on-fly-isolate` to write an isolated catalog (that's basically an archive with only the metadata). The dar documentation notes that `--on-fly-isolate` can't be used with binary deltas, so we take a different approach: first we create a dar archive, then pipe it to a second dar command that extracts only the catalog (and discards the rest of the data). The second dar command reads stdin as the archive of reference (with `--ref -`), writes an isolated catalog (compressed), and includes the delta signatures.
4: https://changelog.complete.org/archives/10527-using-dar-for-data-archiving
But Dar author Denis Corbin mentioned to me that dar has a feature called snapshots. A dar snapshot is a situation where we create only an isolated catalog, and never bother to store the data. Especially when not creating delta signatures, it will be faster than the example above, because it doesn't save the content of files only to discard it. Here's how we can simplify the above example using snapshot mode:
dar --create /tmp/destmeta \ --fs-root /tmp/test \ --delta sig \ --compression=zstd \ --ref +
The `--ref +` activates the snapshot mode.
Let's inspect this:
dest$ ls -l /tmp/destmeta* -rw-r--r-- 1 jgoerzen jgoerzen 53897 Jun 25 19:59 /tmp/destmeta.1.dar dest$ dar -l /tmp/destmeta [Data ][D][ EA ][FSA][Compr][S]| Permission | User | Group | Size | Date | filename --------------------------------+------------+-------+-------+---------+-------------------------------+------------ ... [InRef][D] [-L-][-----][X] -rwxr-xr-x jgoerzen jgoerzen 5 Mio Sun Jun 25 20:00:04 2023 fspl
One thing to note here is that I used `cp -r`, not `cp -a`, so origin timestamps weren't preserved. Let's see if the right thing happens. Now I'll copy `destmeta.1.dar` to the source and we'll move on to step 2.
I copied `/usr/local/bin` to `/tmp/test` on the source machine as well. It was different than that directory on the destination, so we should see lots of changes. Let's create the update archive.
source$ dar --create /tmp/patchfile \ --ref /tmp/destmeta \ --fs-root /tmp/test \ --compression=zstd
So, we create a dar archive called `/tmp/patchfile`, using the information in `/tmp/destmeta` as a reference. We base the archive on `/tmp/test` and compress with zstd. Let's look at the result:
$ ls -lh /tmp/patchfile* -rw-r--r-- 1 jgoerzen jgoerzen 18M Jun 25 20:04 /tmp/patchfile.1.dar $ dar -l /tmp/patchfile --------------------------------+------------+-------+-------+---------+-------------------------------+------------ [Saved][ ] [-L-][ 78%][X] -rwxr-xr-x jgoerzen jgoerzen 12 Mio Sun Jun 25 20:01:36 2023 flrig ... [Delta][ ] [-L-][ 74%][ ] -rwxr-xr-x jgoerzen jgoerzen 5 Mio Sun Jun 25 20:01:36 2023 fspl ... [--- REMOVED ENTRY ----] (Sun Jun 25 20:01:36 2023) [-] doarchive
So, in these excerpts:
We can get even more information in the XML output from dar:
<File name="flrig" size="12 Mio" stored="2 Mio" crc="4b78ea00" dirty="no" sparse="yes" delta_sig="no" patch_base_crc="" patch_result_crc=""> ... <File name="fspl" size="5 Mio" stored="1 Mio" crc="701f322e" dirty="no" sparse="no" delta_sig="no" patch_base_crc="6111a321" patch_result_crc="78199781"> ... <File name="doarchive"> <Attributes data="deleted" metadata="absent" user="" group="" permissions="" atime="" mtime="1687741296" ctime="" /> </File>
Here you can actually see that, when patching, dar encodes information to validate the result is correct. OK, let's copy this to the destination and move on to step 3.
Now on the destination, we can apply the diff and see what it does:
dest$ dar --extract /tmp/patchfile \ --fs-root /tmp/test \ --no-warn
The `--no-warn` option said to not warn before overwriting files, since we want to do just that anyhow.
Inspecting the result, it worked! It did the exact right thing.
Let's see if we can verify it works. I'll append a 10 bytes to the end of a file and see what happens. It should be a smaller transfer.
After creating the files as before:
source$ ls -lh /tmp/patchfile.1.dar -rw-r--r-- 1 jgoerzen jgoerzen 3.7K Jun 25 22:06 /tmp/patchfile.1.dar source$ dar -l /tmp/patchfile -txml | less ... <File name="fspl" size="5 Mio" stored="477 o" crc="3e66df64" dirty="no" sparse="no" delta_sig="no" patch_base_crc="78199781" patch_result_crc="78199781">
Yep, that took just 477 bytes to represent the change to the 5.3MB file. The delta patch worked.
Now, what if I change the timestamp on the file on the destination? It should result in a very small change to bring the timestamp back in line -- and indeed it did.
These dar commands could very easily be scripted. They could run over your favorite asynchronous transport: anything from a USB stick to Filespooler[5], NNCP[6], or Syncthing[7].
Incidentally, you might observe that these commands are just a regular use of dar's incremental backup support. Indeed, that is true. While ordinarily, you might save the state from the prior backup, there's nothing preventing you to save the state from something else; dar is just a CLI and it will diff whatever you tell it to diff.
--------------------------------------------------------------------------------
dar is a Backup[9] and archiving tool. You can think of it as as more modern tar. It supports both streaming and random-access modes, supports correct incrementals (unlike GNU tar's incremental mode), Encryption[10], various forms of compression, even integrated rdiff deltas.
Here are some (potentially) interesting topics you can find here:
(c) 2022-2024 John Goerzen