💾 Archived View for capsule.adrianhesketh.com › 2022 › 06 › 06 › backup-github-repos-to-s3 captured on 2023-05-24 at 17:52:47. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-07-16)
-=-=-=-=-=-=-
Application and infrastructure code is often the result of months or years of combined effort from a team, costing a large amount of money to create. It makes sense to keep a backup of this digital asset, in case of accidental (or malicious) loss.
Years ago, IT teams would take responsibility for backing up the on-premises SVN/SourceSafe/Mercurial/Git servers to tape, and organise shipping the tapes off-site on a daily basis.
These days, I'm using Github, or other hosted SaaS platforms to store code, but that doesn't absolve me of the responsibility to take backups, since no service is perfect.
For example, in 2017, Gitlab lost some customer data in a widely publicised incident [1]
There are 3rd party backup services that offer code backup, and if you're able to use them, these are probably the best route. However, in some cases, procurement may not be possible, or there might not be a solution in place. That's how I ended up writing my own backup script.
The script uses the Github CLI [2] to list up to 1000 repositories, then uses `xargs` to execute a `gh repo clone` command for each of the repos.
Once that's done, the script uses the AWS CLI to upload the content to S3.
#!/bin/bash # https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e8223e16425 set -euxo pipefail echo "Logging in with personal access token." export GH_TOKEN=$BACKUP_GITHUB_PAT gh auth setup-git echo "Downloading repositories for" $BACKUP_GITHUB_OWNER gh repo list $BACKUP_GITHUB_OWNER --json "name" --limit 1000 --template '{{range .}}{{ .name }}{{"\n"}}{{end}}' | xargs -L1 -I {} gh repo clone $BACKUP_GITHUB_OWNER/{} echo "Downloaded repositories..." find . -maxdepth 1 -type d echo "Uploading to S3 bucket" $BACKUP_BUCKET_NAME "in region" $BACKUP_AWS_REGION aws s3 sync --region=$BACKUP_AWS_REGION . s3://$BACKUP_BUCKET_NAME/github.com/$BACKUP_GITHUB_OWNER/`date "+%Y-%m-%d"`/ echo "Complete."
To give the script access to download all of the repositories, the script uses a Github Personal Access token [3].
The token needs to be given permissions to read from all repositories.
To write to S3, the script requires that the AWS CLI is configured with access. This can be done using lots of techniques, including setting various AWS environent variables. The best way to provide access is to create an IAM role in AWS (not a user) that has write access to your backup bucket, and to allow the machine or human user that's running the script to "assume the role".
Using a role instead of an IAM user with static credentials avoids using the same AWS credentials for months or years, and simplifies administration.
Your backup S3 bucket should be configured to the latest best practice. At the time of writing, it would look something like this.
const backupBucket = new s3.Bucket(this, "backupBucket", { blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL, enforceSSL: true, versioned: true, encryption: s3.BucketEncryption.S3_MANAGED, intelligentTieringConfigurations: [ { name: "archive", archiveAccessTierTime: Duration.days(90), deepArchiveAccessTierTime: Duration.days(180), }, ], })
Since we're using Github already, the easiest way to run some code every day is to use Github Actions [4].
Github Actions can run a YAML-based workflow triggered by changes to git repositories, as you might expect, but it can also run code on a schedule.
name: Backup on: workflow_dispatch: schedule: - cron: '0 0 * * *'
To give the script the AWS permissions, I use a Github Action that assumes a role inside AWS (more on this shortly). At the Github Actions side, the `id-token: write` permission needs to be granted to enable Github Actions to log on to AWS.
permissions: id-token: write contents: read
I've created a Docker container which has all of the dependencies of the script (AWS CLI, Github CLI) pre-installed, along with the script itself, and shipped it as a public image in Github's container registry (`ghcr.io`).
The Github Actions workflow can then be configured to run inside that container.
jobs: Backup: runs-on: ubuntu-latest container: ghcr.io/a-h/githubbackup:main name: Backup
The next step is to assume the AWS role that has permissions to write to the backup S3 bucket, using the `configure-aws-credentials` Github Action [5].
The IAM role needs to be configured to enable it to be assumed by Github. An example is in the documentation [6]:
steps: - name: Assume role uses: aws-actions/configure-aws-credentials@v1 with: role-to-assume: ${{ secrets.BACKUP_AWS_ROLE }} aws-region: ${{ secrets.BACKUP_AWS_REGION }}
Once the role is assumed, I think it's a good idea to print out the role to the logs, so you can check that it worked OK.
- name: Display assumed role run: aws sts get-caller-identity
Finally, the `backup-organisation-code` script can be run. Note the use of Github Secrets to store the parameters.
- name: Backup shell: bash env: BACKUP_GITHUB_PAT: ${{ secrets.BACKUP_GITHUB_PAT }} BACKUP_GITHUB_OWNER: ${{ secrets.BACKUP_GITHUB_OWNER }} BACKUP_AWS_REGION: ${{ secrets.BACKUP_AWS_REGION }} BACKUP_BUCKET_NAME: ${{ secrets.BACKUP_BUCKET_NAME }} run: backup-organisation-code
You might be surprised to see a Github Personal Access Token in the list. By default, Github Actions only has access to read from the current repository, so the personal access token is used to grant read access to all the repos in the organisation to the backup script.
It's fairly straightforward to back up your Github accounts to AWS, and you get automated email alerts on workflow failures to identify when backups have failed.
It probably took me a day to set this up, but that's less time than it would have taken to procure a 3rd party service and deal with the security audits.
All of the code, and an example is available at [7].
From linear to binary search in Go