Introduction to git-annex (Port Of The Week)

NIL=> https://bsd.network/@solenepercent/106223481238781904 Comment on Mastodon

Introduction

Now that git-annex is available as a package on OpenBSD I can use it again. I've been relying on it a few years ago but it was really complicated for me to compile it and I gave up. Since I really missed it, I'm now back to it and I think it's time to share about this wonderful piece of software.

git-annex is meant to help you manage your data like you would manage books in a library, you have a database telling you where the books are and you can find them on the shelves, or at least you can know who borrowed the book. We are working with digital files that can be copied here so the analogy doesn't fully work, but you could want to put your data in an external hard drive but not everything, and you may want to have some data on multiples devices for safety reasons, git-annex automates this.

It works very well for files that are not changing much, I call them "static files", they are music, videos, pictures, documents. You don't really want to use git-annex with files you edit everyday, it doesn't work well because the process can be a bit tedious.

git-annex may not be easy to understand at first, I suggest you try locally to grasp its purpose.

git-annex official website

what git-annex is not

Cheat sheet

Let's create a cheat sheet first. Most git-annex commands have a dedicated man page, but can also provide a simpler help by using "git annex help somecommand".

Create the repository

The first step is to create a repository which is based on git, then we will tell git-annex to init it too.

mkdir ~/MyDataLibrary && cd ~/MyDataLibrary
git init
git annex init "my-computer"

Add a file

When you want to register a file in git annex, you need to use "git annex add" to add it and then "git commit" to make it permanent. The files are not stored in the git repository, it will only contains metadata.

git annex add Something
git commit -m "I added something"

Example:

$ echo "hello there" > hello
$ ls -l hello
-rw-r--r--  1 solene  wheel  12 May 12 18:38 hello
$ git annex add hello
add hello
ok
(recording state in git...)
$ ls -l hello
lrwxr-xr-x  1 solene  wheel  180 May 12 18:38 hello -> .git/annex/objects/qj/g5/SHA256E-s12--aadc1955c030f723e9d89ed9d486b4eef5b0d1c6945be0dd6b7b340d42928ec9/SHA256E-s12--aadc1955c030f723e9d89ed9d486b4eef5b0d1c6945be0dd6b7b340d42928ec9
$  git status hello
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   hello

Make changes to a file

If you want to make changes to a file, you first need to "unlock" it in git-annex, which mean the symbolic link is replaced by the file itself and is no longer in read-only. Then, after your changes, you need to add it again to git-annex and commit your changes.

git annex unlock file
vi file
git annex add file
git commit -m "I changed something" file

Add a remote encrypted repository

If you want to store data (for duplication) on a remote server using ssh you can use a remote of type "rsync" and encrypt the data in many fashions (GPG with hybrid is the best). This will allow to store data on remote untrusted devices.

git annex initremote my-remote-server type=rsync rsyncurl=remote-server.com:/home/solene/git-annex-data keyid=my-gpg@address encryption=hybrid

After this command, I can send files to my-remote-server.

git-annex website about encryption

git-annex website about special remotes

Manage data from multiple computers (with ssh)

If you want to use a remote server through ssh, there are two ways: mounting the remote file system using sshfs or use a plain ssh. If you use sshfs, then it falls as a standard local file system like an external usb drive, but if you go through ssh, it's different.

You need to have a key authentication based for the remote ssh and you also need git-annex on the remote server. It's important to have a bare git repo.

cd /home/data/
git init --bare
git annex init "remote-server"

On your computer:

git remote add remote-server ssh://hostname:/home/data/
git fetch remote-server

You will be able to use commands related to repositories now!

List files and where they are stored

You can use the "git annex list" command to list where your files are physically stored.

In the following example you can see which files are on my computer and which are available on my remote server called "network", "web" and "bittorrent" are special remotes.

here
|network
||web
|||bittorrent
||||
X___ Documentation/Nim/Dominik Picheta - Nim in Action-Manning Publications (2017).pdf
X___ Documentation/ada/Ada-Distilled-24-January-2011-Ada-2005-Version.pdf
X___ Documentation/ada/courseada1.pdf
X___ Documentation/ada/courseada2.pdf
X___ Documentation/ada/courseada3.pdf
X___ Documentation/scheme/artanis.pdf
X___ Documentation/scheme/guix.pdf
X___ Documentation/scheme/manual_guix.pdf
X___ Documentation/skribilo/skribilo.pdf
X___ Documentation/uck2ep1.pdf
X___ Documentation/uck2ep2.pdf
X___ Documentation/usingckermit3e.pdf
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/01 - Daftendirekt.flac
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/02 - Wdpk 83.7 fm.flac
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/03 - Revolution 909.flac
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/04 - Da Funk.flac
XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/05 - Phoenix.flac
_X__ Musique/Alan Walker/Alan Walker - Different World/01 - Alan Walker - Intro.flac
_X__ Musique/Alan Walker/Alan Walker - Different World/02 - Alan Walker, Sorana - Lost Control.flac
_X__ Musique/Alan Walker/Alan Walker - Different World/03 - Alan Walker, Julie Bergan - I Don_t Wanna Go.flac

List files locally available

If you want to list the files for which you have the content available locally, you can use the "list" command from git-annex but only restrict to the group "here" representing your local repository.

git annex list --in here

Work with a remote repository

Copy files to a remote

If you want to duplicate files between repositories to have multiples copies you can use "git annex copy".

git annex copy Music -t remote-server

Move files to a remote

If you want to move files from a repository to another (removing the content from origin) you can use "git annex move" which will copy to destination and remove from origin.

git annex move Music -t remote-server

Get a file content

If you don't have a file locally, you can fetch it from a remote to get the content.

git annex get Music/Queen

Forget a file locally

If you don't want to have the file locally because you don't have disk space or you simply don't want it, you can use the "drop" command. Note that "drop" is safe because git-annex won't allow you to drop files that have only one copy (except if you use --force of course).

git annex drop Music/Queen

Real life example: I have a very huge music library but my laptop SSD is too small, I get get some music I want and drop the files I don't want to listen for a while.

Use mincopies to enforce multi repository data duplication

The numcopies and mincopies variables can be used to tell git-annex you want exactly or at least "n" copies of the files, so it will be able to protect you from accidental deletions and also help uploading files to other repositories to match the requirements.

Enable per directory recursively

echo "* annex.mincopies=2" > .gitattributes

Only upload files not matching the num copies

If you have multiples repositories and some files doesn't match the copies requirements, you can use the following commands to only push the files missing copies.

git annex copy --auto -t remote-server

Real life example: I want my salaries PDF to be really safe, I can ask to have 2 copies of those and then run a sync to the remote server which will proceed to upload them if there is only one copy of the file yet.

Verifying integrity and requirements

There is the git-annex fsck command which will check the integrity of every file in the local repository and reports you if they are sane (or not), but it will also tell you which file doesn't meet the mincopies requirements.

git annex fsck

Reversibility

If for some reasons you want to give up git-annex, you can easily get all your files back like a normal file system by using "git annex unlock ." on the top directory of your repository, every local files will be replaced by their physical copy instead of the symlink. Reversibility is very important when you deal with your data because it means you are not stuck forever with a tool in case it's broken or if you want to switch to another process.

My workflow

I have a ~/DATA/ directory in which I have sub directories {documents,documentation,pictures,videos,music,images}, documents are papers or legal papers, documentation are mostly PDF. Pictures are family pictures and images are wallpapers or stupid images I want to keep.

I've set a mincopies to 2 for documents and pictures and my music is not on my computer but on a remote, I get the music files I want to listen when I'm on the local network with the computer having the files, I drop them locally when I'm bored.

Conclusion

git-annex separates content from indexation, it can be used in many ways but it implies an archivist philosophy: redundancy, safety, immutability (sort of). It is not meant for backup, you can backup your directory managed by git-annex, it will save the data you have locally, you will have to make backup of your other data as well.

I love that tool, it's a very nice piece of software. It's unique, I didn't find any other program to achieve this.

More resources

git-annex official walkthrough

git-annex special remotes (S3, webdav, bittorrent etc..)

git-annex encryption