💾 Archived View for nomadpengu.in › thoughts › notebooks captured on 2023-12-28 at 15:24:12. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Sketches for a digital lab assistant

Posted on 2022-03-10

Of all the infuriating things about working in neuroimaging, one of the most frustrating is that there isn't (afaik) a great way to version control jupyter notebooks. We're often running many very similar "experiments" and end up with many copies of basically the same notebook, but with minor changes in processing steps or parameters. Since most people in this space (especially on the more medical/translational side) don't have programming backgrounds, notebooks usually end up following what I'll call here Word Doc Versioning -- i.e. you keep appending the word "final" to the end of your filename ad infinitum. In a more realistic example, for one of my projects, I went from "corticalpercentage" to "permute_corticalpercentage" to "permute_corticalpercentage_seizuredrop".

Much of this comes down to the fact that noteboks are supposed to be used for prototyping only, and code should be written into backend modules when it matures. However, this is not the way it works in many labs, since many people (especially MDs) do not have the skills to set up IDEs and dev environments (and do not have the time to learn). Notebooks are an easy way for everyone with limited computing backgrounds to run and make minor modifications to research code. The barrier to entry is further lowered in my lab, as the notebooks are all hosted on a central JupyterHub node run by the institution's research computing support team. This is all to say that I realize that a notebook-based workflow is not ideal for an experienced dev, but there's not really a better option for my situation.

Even though I do have an education in software and I have it drilled into me to use version control tools, I still often end up with Word Doc Versioning as well. Some of this is definitely due to laziness, but also the existing tools are often painful to use. This feels like it should be a solved problem, but it's not.

Existing solutions

Here are some solutions and reasons why they're not satisfactory.

Vanilla git

This seems like an obvious solution, after all, notebooks are just source code with fancier comments right?

Problems:

Since jupyter notebooks are essentially huge json files mapping the structure of their cells, diffs are illegible. (In this respect, R Markdown notebooks are superior to jupyter notebooks)
Notebooks can contain rich content like images which can quickly blow up repo size if you're not careful
Versioning results by versioning notebooks is problematic, as notebooks may rely on exogenous data which is liable to change. If the data dependencies change, either 1. you have to re-run and commit every time a dependency changes to reflect the new output or 2. your git controlled notebook will have outdated results. The only way to avoid this is to make seperate copies of notebooks which read from different versions of input data -- this is basically just the same as Word Doc Versioning, but under git control (which is undeniably better, but not ideal)

Datalad/DVC

There are a number of composite data versioning and workflow managment tools such as Datalad (developed specifically for the neuroimaging community) and DVC (developed mostly for machine learning tasks). These are built on top of git and git-annex for version controlling input and output data, and have custom workflow engine systems similar to Snakemake and Nextflow.

Problems:

Notebooks are second class citizens, even though they're widely used in both communities. The only way to incorporate notebooks into defined workflows is by using something like Papermill, which gets very messy very quickly.
The way to run multiple versions of an experiement is to make new branches. While this sounds good in theory, it actually is quite a pain when you want to do something like compare the results of two different analyses. You would need to copy results from one branch into a folder outside of the git repo, then checkout the other branch, repeat. Also, you won't be able to run two versions of the analysis at the same time, which is necessary when a single job might take hours or days to complete.
I simply don't want a wrapper on top of git! Don't make me type `datalad save`, I just want to `git commit`!
These monolithic systems are the opposite of the unix philosphy. This means that if you are trying to do something outside of what the maintainers anticipate (which _should_ happen regularly if you're doing interesting research), they become systems you have to actively fight against.
Even though filenames no longer have to follow Word Doc Versioning, the Word Doc Versioning just gets pushed onto branches. Now instead of final_analysis_final_final.ipynb, I just have a branch called final_analysis_final_final.

Towards a digital lab assistant

I think the main problem with the solutions above is that, while they do indeed enhance reproducibility and rigour, they do not decrease (and sometimes increase) cognitive load. An ideal notebook management system for me would feature:

Notebooks as the assumed default way of executing code
A persistent TUI/REPL interface
Tracking tags/descriptions for each notebook (so that filenames can be kept as short unique identifiers)
Tracking input/output data WITHOUT being a full-blown workflow engine
Branching notebooks while keeping a flat file hierarchy (so you can view and run multiple experiments at once)
Merging branches together to create comparisons/insights
A system for tracking/standardizing language and variable names throughout all notebooks. I'm thinking less of a templating system, and more like a big dictionary of terms that you can copy into your clipboard when writing code.

I'll probably try to implement something like this in the next few weeks, so stay tuned.

Thoughts? Suggestions for existing solutions? Reach out by email:

mailto:nomadpenguin@protonmail.com

Up one level

Index