Docker as a platform for reproducability
Last month I released an early stage of a project called JiffyLab that was oriented around the ease of setting up a teaching lab environment. In the process I became a bit more familiar with Docker which provides a key component of JiffyLab's underlying tech, namely some improvements on the usage of linux containers.
I've been paying more attention to the discussion around open science and how the tooling can be improved. One of the great sources for this increasingly active area is Titus Brown's blog.
A recent guest post there brings up some of the challenges of getting a replicatable environment for science. To simplify things a bit, the quest is how to get something like the benefit of a git commit hash, but for the complete experimental stack, so a given experimental computation can be reproduced with the same code, the same data, and the all the same environmental conditions. In an earlier post Titus makes the point that a black box snapshot of a VM is not sufficient, as it doesn't facilitate people building on the work and methods. Anyone who has worked on fostering open software developement knows that there is much more work invovled to document the setup of a full development environment, compared to listing system requirements to simply run the software. The basic idea is that it is far more valuable to extend and remix ones efforts, rather than just replay the computational tape and verify that the outcome is in fact the same. As an aside, I'd argue that unless you know HOW the machinery that reproduces the result is working, you haven't actually verified the result is being reproduced in a meaningful way.
So how does Docker fit into this? Docker works by allowing you to layer a set of read-only filesystem layers into a stardized process running environment, with a writable 'container' layered on top (see this explanation with graphics). Any changes to the container can be captured as another image layer and be commited as a new image, or tag.
How are these different than VM snapshots?
First a Docker image always remains read-only, and so makes the preservation of a known state a bit more of the default than a VM, where the FS is read/write by default, and you have to explicitly store a snapshot - Docker is more the inverse.
Docker provides a DSL for building images that is roughly similar in purpose to Chef/Puppet, but is simpler and is more focused on the runtime functionality and less about the entire machine. The Docker project is also, for the time, providing an index and server where people can upload and share images (namespaced under a login name). The docker build DSL contains a "FROM" command, which specifies the starting image/tag from which to build on. This provides a rudimentary infrastructure for remixing and adopting of other people's work. For example I could have a docker build file that might look like:
FROM titus/diginorm ADD 'my_alternate_dataset' RUN /bin/key_pipeline_step.py
etc. Each step in a computational flow can be stored as a lightweight layer/tag in a final docker container, and the process can be restarted or branched from any point along that pipeline. Unlike VMs, these containers are also lightweight in both space, startup time, and runtime resources.
Wakari has already captured some of this, in the form of being able to create and save custom Wakari environments, but Docker provides more of a raw low level tool. This lower level functionality cuts both ways. On one hand, it is open, and completely accesible. I think the folks behind Wakari have only the best of intentions and it is a brilliant project in this domain, but as someone who believes in something being open all the way down, it is hard to accept that at some point I hit a black box. On the other hand, many scientists won't want to bother with messing too much with the technology stack, they will feel that the extra work to make their efforts truely remixable aren't worth this extra effort. In the end I think the technical challenges are secondary to the cultural ones.
In much of the Python world we have mostly-ish been able to except virtualenv as a standard. In science there is more fractured set of sub-cultures with a greater diversity in needs backgrounds, existing workflows, and legacy conventions. Docker would need some wrapping and/or UI layers to make it more friendly and easy to use for the common cases, but I'm not sure that even if you could completely solve the technical and ease-of-use issues that you could ever reach a high enough level of adoption to become anything like a standard or convention.