Now that git-annex is available as a package on OpenBSD I can use it again. I've been relying on it a few years ago but it was really complicated for me to compile it and I gave up. Since I really missed it, I'm now back to it and I think it's time to share about this wonderful piece of software.
git-annex is meant to help you manage your data like you would manage books in a library, you have a database telling you where the books are and you can find them on the shelves, or at least you can know who borrowed the book. We are working with digital files that can be copied here so the analogy doesn't fully work, but you could want to put your data in an external hard drive but not everything, and you may want to have some data on multiples devices for safety reasons, git-annex automates this.
It works very well for files that are not changing much, I call them "static files", they are music, videos, pictures, documents. You don't really want to use git-annex with files you edit everyday, it doesn't work well because the process can be a bit tedious.
git-annex may not be easy to understand at first, I suggest you try locally to grasp its purpose.
Let's create a cheat sheet first. Most git-annex commands have a dedicated man page, but can also provide a simpler help by using "git annex help somecommand".
The first step is to create a repository which is based on git, then we will tell git-annex to init it too.
mkdir ~/MyDataLibrary && cd ~/MyDataLibrary git init git annex init "my-computer"
When you want to register a file in git annex, you need to use "git annex add" to add it and then "git commit" to make it permanent. The files are not stored in the git repository, it will only contains metadata.
git annex add Something git commit -m "I added something"
Example:
$ echo "hello there" > hello $ ls -l hello -rw-r--r-- 1 solene wheel 12 May 12 18:38 hello $ git annex add hello add hello ok (recording state in git...) $ ls -l hello lrwxr-xr-x 1 solene wheel 180 May 12 18:38 hello -> .git/annex/objects/qj/g5/SHA256E-s12--aadc1955c030f723e9d89ed9d486b4eef5b0d1c6945be0dd6b7b340d42928ec9/SHA256E-s12--aadc1955c030f723e9d89ed9d486b4eef5b0d1c6945be0dd6b7b340d42928ec9 $ git status hello On branch master Changes to be committed: (use "git restore --staged <file>..." to unstage) new file: hello
If you want to make changes to a file, you first need to "unlock" it in git-annex, which mean the symbolic link is replaced by the file itself and is no longer in read-only. Then, after your changes, you need to add it again to git-annex and commit your changes.
git annex unlock file vi file git annex add file git commit -m "I changed something" file
If you want to store data (for duplication) on a remote server using ssh you can use a remote of type "rsync" and encrypt the data in many fashions (GPG with hybrid is the best). This will allow to store data on remote untrusted devices.
git annex initremote my-remote-server type=rsync rsyncurl=remote-server.com:/home/solene/git-annex-data keyid=my-gpg@address encryption=hybrid
After this command, I can send files to my-remote-server.
git-annex website about encryption
git-annex website about special remotes
If you want to use a remote server through ssh, there are two ways: mounting the remote file system using sshfs or use a plain ssh. If you use sshfs, then it falls as a standard local file system like an external usb drive, but if you go through ssh, it's different.
You need to have a key authentication based for the remote ssh and you also need git-annex on the remote server. It's important to have a bare git repo.
cd /home/data/ git init --bare git annex init "remote-server"
On your computer:
git remote add remote-server ssh://hostname:/home/data/ git fetch remote-server
You will be able to use commands related to repositories now!
You can use the "git annex list" command to list where your files are physically stored.
In the following example you can see which files are on my computer and which are available on my remote server called "network", "web" and "bittorrent" are special remotes.
here |network ||web |||bittorrent |||| X___ Documentation/Nim/Dominik Picheta - Nim in Action-Manning Publications (2017).pdf X___ Documentation/ada/Ada-Distilled-24-January-2011-Ada-2005-Version.pdf X___ Documentation/ada/courseada1.pdf X___ Documentation/ada/courseada2.pdf X___ Documentation/ada/courseada3.pdf X___ Documentation/scheme/artanis.pdf X___ Documentation/scheme/guix.pdf X___ Documentation/scheme/manual_guix.pdf X___ Documentation/skribilo/skribilo.pdf X___ Documentation/uck2ep1.pdf X___ Documentation/uck2ep2.pdf X___ Documentation/usingckermit3e.pdf XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/01 - Daftendirekt.flac XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/02 - Wdpk 83.7 fm.flac XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/03 - Revolution 909.flac XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/04 - Da Funk.flac XX__ Musique/Daft Punk/01 - Albums/1997 - Homework/05 - Phoenix.flac _X__ Musique/Alan Walker/Alan Walker - Different World/01 - Alan Walker - Intro.flac _X__ Musique/Alan Walker/Alan Walker - Different World/02 - Alan Walker, Sorana - Lost Control.flac _X__ Musique/Alan Walker/Alan Walker - Different World/03 - Alan Walker, Julie Bergan - I Don_t Wanna Go.flac
If you want to list the files for which you have the content available locally, you can use the "list" command from git-annex but only restrict to the group "here" representing your local repository.
git annex list --in here
Simply mark it as "dead".
git annex dead $repo_name
git annex initremote $name type=rsync rsyncurl=remote-server:/home/solene/mydirectory keyid=your@email encryption=shared
If you want to duplicate files between repositories to have multiples copies you can use "git annex copy".
git annex copy Music -t remote-server
If you want to move files from a repository to another (removing the content from origin) you can use "git annex move" which will copy to destination and remove from origin.
git annex move Music -t remote-server
If you don't have a file locally, you can fetch it from a remote to get the content.
git annex get Music/Queen
If you don't want to have the file locally because you don't have disk space or you simply don't want it, you can use the "drop" command. Note that "drop" is safe because git-annex won't allow you to drop files that have only one copy (except if you use --force of course).
git annex drop Music/Queen
Real life example: I have a very huge music library but my laptop SSD is too small, I get get some music I want and drop the files I don't want to listen for a while.
The numcopies and mincopies variables can be used to tell git-annex you want exactly or at least "n" copies of the files, so it will be able to protect you from accidental deletions and also help uploading files to other repositories to match the requirements.
echo "* annex.mincopies=2" > .gitattributes
If you have multiples repositories and some files doesn't match the copies requirements, you can use the following commands to only push the files missing copies.
git annex copy --auto -t remote-server
Real life example: I want my salaries PDF to be really safe, I can ask to have 2 copies of those and then run a sync to the remote server which will proceed to upload them if there is only one copy of the file yet.
There is the git-annex fsck command which will check the integrity of every file in the local repository and reports you if they are sane (or not), but it will also tell you which file doesn't meet the mincopies requirements.
git annex fsck
If for some reasons you want to give up git-annex, you can easily get all your files back like a normal file system by using "git annex unlock ." on the top directory of your repository, every local files will be replaced by their physical copy instead of the symlink. Reversibility is very important when you deal with your data because it means you are not stuck forever with a tool in case it's broken or if you want to switch to another process.
I have a ~/DATA/ directory in which I have sub directories {documents,documentation,pictures,videos,music,images}, documents are papers or legal papers, documentation are mostly PDF. Pictures are family pictures and images are wallpapers or stupid images I want to keep.
I've set a mincopies to 2 for documents and pictures and my music is not on my computer but on a remote, I get the music files I want to listen when I'm on the local network with the computer having the files, I drop them locally when I'm bored.
git-annex separates content from indexation, it can be used in many ways but it implies an archivist philosophy: redundancy, safety, immutability (sort of). It is not meant for backup, you can backup your directory managed by git-annex, it will save the data you have locally, you will have to make backup of your other data as well.
I love that tool, it's a very nice piece of software. It's unique, I didn't find any other program to achieve this.
git-annex official walkthrough