Schedule

10:00 - Welcome and Introduction
10:20 - Hacking
12:15 - Catch up before Lunch
13:15 - Hacking
15:45 - Wrap up
16:00 - End of session

Some self-evident truths

(once you’ve worked in research for a while)

Reproducibility counts
Reproducibility doesn’t happen by itself
Open science and reproducibility go hand in hand
You’ll thank yourself later
Technology is driving change and providing opportunities

Organisations that care

Defining our terms

Taken from https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions.html

Reproducibility is just the beginning, but it underpins the goal of generalisable findings within a domain.

An overview of technologies

A diagram showing different aspects of reproducibility

Taken from https://nbis-reproducible-research.readthedocs.io/en/latest/tutorial_intro/

About you

Resources

The Turing Way is a great and very comprehensive initial resource for dealing with all aspects of reproducibility.
See our Python project template for a quick start on new projects with best practice for reproducibility baked in.
Our Graduate School Course on Essential Software Engineering for Researchers gives an introduction to several reproducibility topics including writing tests, managing environments and how to structure code.

Git

The Graduate School course developed by the Imperial RSE team. Good for the basics before moving onto look at use of GitHub for publishing your code and facilitating team work. https://imperialcollegelondon.github.io/grad_school_git_course/
https://learngitbranching.js.org provides a fun and interactive set of challenges for learning advanced features of Git to support working flexibly.
A variety of courses pitched at different levels are available on LinkedIn Learning https://www.linkedin.com/learning/topics/git
The relevant section of The Turing Way.
An enormous amount of material exists, see Awesome Git for an overview.

Environment Management

This is tricky one to cover in details as doing it correctly tends to vary on a language-by-language basis. Conda is the recommended approach here due to its flexibility in covering multiple languages as well as virtual environments.

If you’re not used to using virtual environments with Conda this towards data science article does a good job of explaining their value. It’s also a good initial introduction to using environment.yml files as the most robust way to manage your environments. This is Python focused but all its insights generalise.
The relevant section of The Turing Way

Containers

Reproducible Computational Environments Using Containers provides a good introductory guide to both Docker and Singularity with comparisons of both. The course was developed with researchers in mind and includes content on using containers in research workflows.
Ten simple rules for writing Dockerfiles for reproducible data science gives a very valuable set of pointers for using Docker in the context of data analysis. A must read for creating portable analyses.
Once you’ve gotten the hang of Docker basics these best practice tips will help you write clean and efficient Dockerfiles.
A variety of courses for Docker are available on LinkedIn Learning. https://www.linkedin.com/learning/search?keywords=docker
The relevant section of The Turing Way

Jupyter Notebooks/R Markdown

As it sounds R Markdown: The Definitive Guide is a pretty definitive guide with some handy examples built in.
This blog post on RMarkdown Driven Development is a very interesting take on how to go from rough and ready data exploration to a polished document that cleanly tells a consistent narrative. Very relevant to Jupyter Notebooks as well.
Jupyter Notebooks are extremely popular and fairly painless to start using. Checkout the quick start guide to get up and running.
Once you have some Notebooks take the next step to make them usable online by anybody with a single click via binder.

Workflow Management

For basic problems make can be a good tool if you’re already familiar with it. Software carpentry have a good lesson on this.
Snakemake has an ok set of tutorials for getting started, though you’ll appreciate them best with a knowledge of bioinformatics. If you’re working on Windows I’d probably ignore the advice to use vagrant and use the Windows Subsystem for Linux instead.
Nextflow should only be considered if you aren’t familiar with Python. I struggled to find a tutorial that looked like it would work easily. Try the documentation for an introduction and breakdown of a simple example.
If you’re interested in something with a user friendly graphical interface you could look at Galaxy. This takes a very different approach based on having a permanent server available to run workflows.

Testing and Continuous Integration

We cover testing in of research software in detail in our Graduate School course Essential Software Engineering for Researchers
The Turing Way has a good section on strategies for testing scientific codes.
If you’re going to have tests its good practice to make sure that they are run as part of your code hosting with continuous integration. We have an experimental guide under development on this topic.

RSE Reproducibility Hackathon 02/12/2020

A hackathon event focusing on reproducibility in computational research. Led by the Imperial College Research Software Engineering team.