💾 Archived View for senders.io › gemlog › 2021-04-26-auto-syncing.gmi captured on 2023-05-24 at 17:52:49. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2021-12-04)

-=-=-=-=-=-=-

Auto Syncing

I have a remote server that acts as sort-of a DMZ between my friends and local server. I recently obtained two 16TB drives that I set up in my local server to act as a NAS. The only issue is - my remote server has limited space and lacks a decent structure to just rely on basic rsyncing. So I wrote a script that checks if any new files exist in a place and downloads them.

ssync

The script I wrote I called ssync and I set it up to just run on a */1 cron.

[https] ssync (git)

... # variable setup and prechecks

log "Fetching files"
mkdir -p $RUN_DIR
ssh -i $KEY_FILE $REMOTE \
  "find ${REMOTE_DIR} -newermt ${PREV_RUN_DATE} -exec realpath --relative-to ${REMOTE_DIR} {} \;" \
  >> $CURGET_FILE
comm -23 <(sort -u $CURGET_FILE) <(sort -u $FETCHED_FILE) > $FETCH_FILE
COUNT=$(wc -l $FETCH_FILE | cut -d' ' -f1)

if [ $COUNT -gt 0 ]; then
  # Syncing
  log "Found ${COUNT} files to fetch"

  cat $FETCH_FILE >> $FETCHED_FILE
  log "Wrote files to fetched files"
  log "Syncing now"
  cat $FETCH_FILE | xargs -n1 -P$PARALLEL -I '{}' rsync -e "ssh -i $KEY_FILE" \
    -av \
    $REMOTE:${REMOTE_DIR}/'{}' ${SRC_DIR}
else
  log "No files to sync"
fi
echo $NEXT_RUN_DATE > $LASTRAN_FILE
log "Done syncing"

The script relies on 5 main utilities:

I use ssh to connect to the remote server and find all the files that have been created since the last run and pipe that into a file on my local machine. Then I compare those files against everything I've fetched to see if anything has already been fetched (or is currently being fetched).

With the new list of files to grab (the output of comm) I xargs those files into rsync with a parallelism of 5 (about what my bandwidth can manage if they're large files)

Once the sync is complete I update the next run date with start time of the script to use on the next run. This means if a sync takes 15 minutes - each minute we will still be looking for files using the date of the last successful run. I did this as a way to ensure we're not missing anything. I like to write my scripts with a margin of error - I'd rather have a duplicate on my local machine than lose something. But the fetched file contains a list of everything fetched - or in the process of fetching - so even though our ssh list will pull the files comm weeds those out.

Everything goes into an inbox directory that I sort though periodically cataloging the files into their proper directory.

Joy

It is very nice to now be able to not have to think "did I sync this directory yet?". Since I have a limited amount of space on the remote server it's good to know I can just delete anything a few days old without worry.

I've had this remote server for 8 years and I've never got around to setting this up. So it's a huge relief having this finally.

Pain

This entire process would've been SO MUCH simpler if I had just made the two directories match file structure. I could simply run rsync relative to the two main outer-directories and called it a day.

find

I use xargs and comm all the time. But find is one of those utilities that honestly I never knew had the -exec option. This has been a huge life savor in getting everything off of this remote server that I have missed.

I wanted to find any files that didn't exist already on my local machine so on both servers I ran

find . -type f -exec basename {} \; | sort > index.txt

This created a list of all of the files on each device. Then I could just comm -23 and get the ones that I needed to go get. I could rely that the names wouldn't conflict so I just did:

cat remote-only.txt | xargs -I{} find . -iname '*{}' -exec realpath {} \;

And now I knew exactly where the files were that I was missing and I could figure out the best way to fetch all the files.

I spend my weekend shell scripting

I don't know about most of you - but I honestly could not compute without access to the shell. So much of what I do is simplified because I can write a line of commands that execute the actions I want as a single process (through pipelining) and then make a few adjustments and run it again for a new set of data.

[youtube] AT&T Archives: The UNIX Operating System (ts=5:41)

I recommend you watch the entire video if you're a fan of the unix style operating systems - but around 6 minutes David Kernighan explains pipelining and the power it provides. I use this video as a reference all the time when asked what I love about my Linux machine, and why I wouldn't want to go back to windows full time. He breaks it down into such a clear and precise way that I find is useful to explain to non-technical/people unfamiliar with unix or the command line, why and how the command line can be so powerful.

Time to write

This was a perfect candidate for writing a script because the process of syncing files over the internet takes time - so spending a few hours perfecting this script ultimately saves me time and creates peace of mind. But writing a script to look at the file and try and discern where to move it to once it appeared on the local machine is not worth the time. A simple mv will take seconds at most - but parsing the file name, maybe looking at some metadata - and trying to guess what directory it belongs in - honestly there are far too many requirements to even list - that a human taking a few minutes on the weekend to just create some new dirs and move the files into them is where scripting isn't worth it (yet).

Conclusion

I know in my reply about the shell not being good for automation I go into shell scripting - but I wouldn't be surprised if I've gushed over it in a few other gemlogs. There is something about the terminal that just clicks with how my brain wants to manage the PC, and I wouldn't want to use a computer with out.

Links

Gemlog

Home