💾 Archived View for nomadpengu.in › thoughts › notebooks captured on 2023-07-22 at 16:26:02. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-04-28)
-=-=-=-=-=-=-
Posted on 2022-03-10
Of all the infuriating things about working in neuroimaging, one of the most frustrating is that there isn't (afaik) a great way to version control jupyter notebooks. We're often running many very similar "experiments" and end up with many copies of basically the same notebook, but with minor changes in processing steps or parameters. Since most people in this space (especially on the more medical/translational side) don't have programming backgrounds, notebooks usually end up following what I'll call here Word Doc Versioning -- i.e. you keep appending the word "final" to the end of your filename ad infinitum. In a more realistic example, for one of my projects, I went from "corticalpercentage" to "permute_corticalpercentage" to "permute_corticalpercentage_seizuredrop".
Much of this comes down to the fact that noteboks are supposed to be used for prototyping only, and code should be written into backend modules when it matures. However, this is not the way it works in many labs, since many people (especially MDs) do not have the skills to set up IDEs and dev environments (and do not have the time to learn). Notebooks are an easy way for everyone with limited computing backgrounds to run and make minor modifications to research code. The barrier to entry is further lowered in my lab, as the notebooks are all hosted on a central JupyterHub node run by the institution's research computing support team. This is all to say that I realize that a notebook-based workflow is not ideal for an experienced dev, but there's not really a better option for my situation.
Even though I do have an education in software and I have it drilled into me to use version control tools, I still often end up with Word Doc Versioning as well. Some of this is definitely due to laziness, but also the existing tools are often painful to use. This feels like it should be a solved problem, but it's not.
Here are some solutions and reasons why they're not satisfactory.
This seems like an obvious solution, after all, notebooks are just source code with fancier comments right?
Problems:
There are a number of composite data versioning and workflow managment tools such as Datalad (developed specifically for the neuroimaging community) and DVC (developed mostly for machine learning tasks). These are built on top of git and git-annex for version controlling input and output data, and have custom workflow engine systems similar to Snakemake and Nextflow.
Problems:
I think the main problem with the solutions above is that, while they do indeed enhance reproducibility and rigour, they do not decrease (and sometimes increase) cognitive load. An ideal notebook management system for me would feature:
I'll probably try to implement something like this in the next few weeks, so stay tuned.
Thoughts? Suggestions for existing solutions? Reach out by email: