Mini-guide to reproducible Python code
A lot of modern research requires custom software to be written, either to do some calculations, analyse experimental data or something else. Creating good quality, sustainable software is always desirable, but ticking all the boxes that are often described as necessary to accomplish this can be a daunting task for people - researchers - who often have other priorities in mind.
Reproducibility is, however, not an optional feature of a piece of research - including software or otherwise - and that is something that researchers are fully responsible for addressing. Luckily, out of the many requirements of good quality and sustainable software, only a handful are necessary, or can go a long way, to support the reproducibility of the results.
In this post we describe these absolutely essential steps that researchers should take in order to support the reproducibility of their software. The recommendations in this blog post are for software developed using Python. It might not apply to all cases, and it is not fool proof as reproducibility is a really complex business, but it is a good start and will narrow the chances of things going wrong when other people try to use the software.
Starting with a script or notebook
Very often, the starting point of some Python code used in research is a plain script
(e.g. analysis.py
) or a Jupyter Notebook (e.g. analysis.ipynb
). In there, a few
dependencies might be loaded, as well as some input data, some analysis is done and
then, finally, a few plots or some processed data files are created. Your software
directory could look like:
Providing this with your paper is not enough to make it reproducible.
What versions of the dependencies are you using? In what operating system are you working on? What version of the input data? Moreover, what version of your code? There are plenty of things that can be different between your computer and where someone else might be running the tool, so let's try to narrow down the uncertainty a bit.
Adding a README and dependencies
The first step is to indicate what dependencies you are using and which versions of
them. This is key because different versions of the dependencies might produce different
results or simply be incompatible. For example, your code might need pandas
,
matplotlib
and numpy
, but it only works with numpy
versions older than version 2. You must
communicate this information so others know what they need to have installed.
Moreover, what version of Python are you using? Version 3.9 is very different than version 3.13. And in what operating system? It is not uncommon that some code works well in one platform, e.g. Linux, but not in another one, or works but produces different results.
To address these two issues, we need to add two extra files.
The first one is a file indicating the dependencies and versions you are using. Common
choices are requirements.txt
files and environment.yml
files. The first one is used
with pip
while the second one is used with conda
, but both serve the same purpose:
define what other (typically) Python tools, specifically, your software depends on.
Note
You can get the full list of packages installed in your current environment with
commands like pip freeze
or
conda env export
but keep in mind that this will include ALL packages in the environment, which might
be more than actually needed by your software if you do not use virtual environments
or were not careful when installing the dependencies.
The second is a README.md
file. In simple terms this is just a small text providing details
on the operating system, Python version, installation instructions - i.e. how to install
the dependencies - and usage instructions.
As an example, your software directory now could look like:
And the contents of the new files could be:
requirements.txt
README.md
# Data analysis software
This software has been tested using:
- Windows 11 Enterprise
- Python 3.12.4
To install the dependencies, open Powershell and run:
`python -m pip install -r requirements.txt`
Then, to run the software run in the same terminal:
`python analysis.py path/to/input/data/folder`
There are plenty of other things that can be included, like how to create a virtual environment, the format of the input data or instructions for other operating systems but, at least, you are providing details on how to run the tool in a system like the one you used.
However, including the above files is still not enough to make your software reproducible. Because, what version of your software should other people be using in order to reproduce your results?
Using version control
Version control is the technique that enables you to track changes in your code, giving
you a unique identifier for each snapshot of your code whenever you commit
changes. A
common tool to do that is git
, which is a command line tool but that also has many
graphical user interfaces for those less keen on the terminal.
It is besides the point of this mini-post to explain how to do version control, so we will just refer to our Introductory course on Git and GitHub. For the purposes of reproducibility, understanding the processes and tools described in this course is more than enough.
If using git
, the only change to the software directory will be the creation of a
hidden folder within, .git
, which is created automatically when you initialise the
repository. It contains all the history of your version controlled directory, but you do
not need to worry about it. You will not need to modify the files there manually, but
only via git
commands, like git commit
.
With version control enabled for your software directory, we can start talking about reproducibility. It is not infallible - it never will be - but it is good enough for most purposes with minimal effort.
Publish in an online repository
This is all good, but your version controlled, properly described software is still in your computer: how do you share it with others, including people reading your papers who want to reproduce your results?
The answer is putting your software into an online repository.
Online code repositories like GitHub, GitLab or Bitbucket enable you to share your code
with others. This might be with yourself in another computer, with a restricted set of
collaborators or with the whole World in the case of public repositories. These
repositories contain your software and all its version controlled history, which means
you can easily indicate to other people what specific version of the software they need
to use in order to reproduce your results. Normally, they have features to easily tag
specially relevant snapshots of your code - e.g. creating releases - to make it easier to
find the right versions with human readable names (e.g. v1.0
, rather than j245er...
,
which is a commit hash).
The same introductory course mentioned above includes instructions on how to use GitHub, so we will not provide more details here.
Note
If you want to be a bit more thorough, you should archive the relevant versions of your code in archives like Zenodo or Figshare, which will give you a Digital Object Identifier (DOI) which will uniquely identify the version and will keep your code "forever". In the end, software history in GitHub can be altered and past versions deleted, so this sort of archiving is a more permanent solution supporting reproducibility.
Conclusions
There is a lot that can be done to improve a piece of software and making it sustainable and reproducible, but getting started and picking the low hanging fruit is very accessible, takes very little time and can go a long way to support your research. You can always refine things later, adding more tooling and creating a more complex structure for your code. But this is a good place to begin.