Introducing containers

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • What are containers, and why might they be useful to me?

Objectives
  • Show how software depending on other software leads to configuration management problems.

  • Explain the notion of virtualisation in computing.

  • Explain the ways in which virtualisation may be useful.

  • Explain how containers streamline virtualisation.

Disclaimers

This looks like the Carpentries’ formatting of lessons because it is using their stylesheet (which is openly available for such purposes). However the similarity is visual-only: the lesson is not being developed by the Carpentries and has no association with the Carpentries.

Some parts of this course are in the early stages of development. Comments or ideas on how to improve the material are more than welcome!

Welcome, all

Introduce yourself and think about the following topics:

  • What research area you are involved in
  • Why you have come on this course and what you hope to learn
  • One thing about you that may surprise people

Write a few sentences on these in the course Etherpad.

The fundamental problem: software has dependencies that are difficult to manage

Consider Python: a widely used programming language for analysis. Many Python users install the language and tools using something such as Anaconda and give little thought to the underlying software dependencies allowing them to use libraries such as Matplotlib or Pandas on their computer. Indeed, in an ideal world none of us would need to think about fixing software dependencies, but we are far from that world. For example:

All of the above discussion is just about Python. Many people use many different tools and pieces of software during their research workflow all of which may have dependency issues. Some software may just depend on the version of the operating system you’re running or be more like Python where the languages change over time, and depend on an enormous set of software libraries written by unrelated software development teams.

What if you wanted to distribute a software tool that automated interaction between R and Python. Both of these language environments have independent version and software dependency lineages. As the number of software components such as R and Python increases, this can rapidly lead to a combinatorial explosion in the number of possible configurations, only some of which will work as intended. This situation is sometimes informally termed “dependency hell”.

The situation is often mitigated in part by factors such as:

Although we have highlighted the dependency issue above, there are other, related problems that multiple versions of tools and software can cause:

Thankfully there are ways to get underneath (a lot of) this mess: containers to the rescue! Containers provide a way to package up software dependencies and access to resources such as data in a uniform and portable manner that allows them to be shared and reused across many different computer resources.

Background: virtualisation in computing

When running software on a computer: if you feed in the same input, to the same computer, then the same output should appear - i.e. the result should be reproducible.

However a computer, let’s call it the “guest”, should itself be able to be simulated as running on another computer we will call the “host”. The guest computer can be said to have been virtualised: it is no longer a physical computer. Note that “virtual machine” is frequently referred to using the abbreviation “VM”.

We have avoided the software dependency issue by virtualising the lowest common factor across all software systems, which is the computer itself, beneath even the operating system software.

Omitting details and avoiding complexities…

Note that this description omits many details and avoids discussing complexities that are not particularly relevant to this introduction session, for example:

  • Thinking with analogy to movies such as Inception, The Matrix, etc., can the guest computer figure out that it’s not actually a physical computer, and that it’s running as software inside a host physical computer? Yes, it probably can… but let’s not go there within this episode.
  • Can you run a host as a guest itself within another host? Sometimes… but, again, we will not go there during this course.

Motivation for virtualisation

What features does virtualisation offer?

Types of virtualisation:

Downsides of virtualisation:

Containers are a type of lightweight virtualisation

Containers are similar to full, hardware-level virtual machines but offer a more lightweight solution. As highlighted above, containers sacrifice the strong isolation that full virtualisation provides in order to vastly reduce the resource requirements on the virtualisation host.

/2020-07-13-Containers-Online/VM%20vs%20Container

The term “container” can be usefully considered with reference to shipping containers. Before shipping containers were developed, packing and unpacking cargo ships was time consuming, and error prone, with high potential for different clients’ goods to become mixed up. Similar to VMs, software containers standardise the packaging of a complete software system (the lightweight virtual machine): you can drop a container into a container host, and it should “just work”.

On this course, we will be using Linux containers - all of the containers we will meet are based on the Linux operating system in one form or another. However, the same Linux containers we create can run on:

We should certainly see people using the same containers on macOS and Windows today.

And what do you do?

Think about your work. How does computing help you do your research? How do you think containers (or virtualisation) could help you do more or better research?

Write a few sentences on this topic in the course Etherpad in the section about yourself that you created earlier.

Containers and file systems

One complication with using a virtual environment such as a container (or a VM) is that the file systems (i.e. the directories that the container sees) can now potentially come from two different locations:

This is illustrated in the diagram below:

Host system:                                                      Container:
------------                                                      ----------
/                                                                 /
├── bin                                                           ├── bin                <-- Overrides host version
├── etc                                                           ├── etc                <-- Overrides host version
├── home                                                          ├── usr                <-- Overrides host version
│   └── auser/data ───> mapped to /data in container ──> ─┐       ├── sbin               <-- Overrides host version
├── usr                                                   │       ├── var                <-- Overrides host version
├── sbin                                                  └───────├── data
└── ...                                                           └── ...

Although there are many use cases for containers that do not require mapping host directories into the container, a lot of real-world use cases for containers in research do use this feature and we will see it in action throughout this lesson.

Docker and Singularity

Docker is software that manages containers and the resources that containers need. While Docker is a leader in the container space, there are many similar technologies available and the concepts we learn today will allow us to use other container platforms even if their command syntax will be a little different.

The second part of this course will introduce a different container platform, Singularity. Singularity is widely available on shared high performance computing systems where Docker’s design makes is unsuitable for the multi-user nature of these systems.

Docker’s terminology

Before we start the first part of the course, we will introduce some of the technical terms used by Docker:

Key Points

  • Almost all software depends on other software components to function, but these components have independent evolutionary paths.

  • Projects involving many software components can rapidly run into a combinatorial explosion in the number of software version configurations available, yet only a subset of possible configurations actually works as desired.

  • Virtualisation is an approach that can collect software components together and can help avoid software dependency problems.

  • Containers are a popular type of lightweight virtualisation that are widely used.

  • Docker and Singularity are two different software platforms that can create and manage containers and the resources they use.