I am an unabashed fan of distributed version control systems (DVCS) such as Mercurial or Git. And from time to time, I get drawn into discussions with friends and colleagues about the pros and cons of these.
One question in particular comes up time and time again: DVCS are complete overkill for the lone coder and - by extension - for the lone scientist. Here are some thoughts on version control for computational scientists, working alone or collaboratively.
In summary, I think one can get at this from several angles: (a) management of
change and a state of mind, (b) the reproducibility of the computational
experiment, (c) showcasing yourself as a researcher/hacker and novel measures
of impact. These all overlap to a certain extent.
Management
of Change and a State of Mind
This is the obvious one. Version control
systems manage change – that is the trivial and obvious thing to state. But
they do it in very different ways.
Version control systems such as CVS or Subversion are in essence
feudalistic models of working: a central server holds a canonical version of an
artifact (source code, an ontology, a piece of writing), which gets pushed to
clients. Because of the feudalism, this means that “commit” equals “inflict”:
someone commits a change to an artifact and it gets inflicted on al the clients
working with the same repository. So what are/were the consequences of this? No
atomic commits (I realize that the discussion as to whether atomic commits are
a good thing or whether even broken code should get checked in is one for an
evening in the pub), code hardly ever got checked in.
Contrast this to distributed version
control sytems. Here, there is a staging system. Code exists in the repository
on your machine and you develop on this. Code may also exist on another machine/server/host
such as Bitbucket or Github, which may or may not hold a canonical version. In
any case – commit here isn’t inflict because it takes a “push” operation to add
the code you may have just committed to your local repository to the remote
one. Furthermore, the commit into your local repository is not coupled to a
push and hence “commit” is not the same as “inflict”.
Typically, the results are more commits at
least locally and the preservation of work. And this makes sense even for the
lone developer.
Another aspect in this discussion concerns
the way in which changes are tracked by these systems: subversion and others of
a similar ilk track versions – changes in the file system – whereas git and
Mercurial track what has actually been changed. Again, an almost trivial
statement to make, but it has huge implications. Merging becomes much, much,
much easier that way – resulting in more branching, more commits, more
experiments. That’s a good thing – particularly as a scientist. Much work in
computational and data science involves parameter sweeps – running the same
protocol again and again – but with altered parameters each time this is done.
Developing workflows and computational procedures quite often require
experimentation – starting from a baseline script, branching, making changes,
merging these back, branching again, experimenting etc…and the commit and
branching mechanisms in version control systems can be used to track and
document these experiments: it is a step towards reproducible computational
science.
It is also a state of mind: the staging
involved in push and pull mechanisms in addition to a commit enables
distributed and therefore massively collaborative working. And sooner or later
even the loneliest of lonely scientists will have to engage in this way of
working, if he or she wants the world to acknowledge and take up the work that
has been done. The way of working is so powerful, that software development
tools such as git and mercurial are now also used to author legislation (http://www.quora.com/Ari-Hershowitz/Posts/Hackathon-Anyone-Recode-Californias-Laws),
to distribute information (German law, for example, is available on Github and even to figure out http://www.wired.com/wiredenterprise/2013/01/this-old-house/.
Bottom up, massively collaborative ways of working are ways of working of the
future – distributed version control systems are one embodiment of this
mindset.
(b)
The Reproducibility of the Computational Experiment
This is picking up the discussion begun in
the previous point. When taking version control systems and combining them with
conventions around organizing the other components of, for example,
bioinformatics projects, we might be able to tackle issues of reproducibility
of computational experiments/investigations a bit better. There has been some
discussion around this on Biostars and also in the literature, most notably in
a paper by Noble about organizing Computational Biology Projects and our Lensfield Paper from a while back.
Version control here fulfills three functions (a) backup, (b) the keeping of a historical
record of work done and (c) enabling concurrent work by multiple collaborators,
which may sooner or later happen to even the lonliest of scientists.
(c)
Showcasing yourself as a hacker/developer/bioinformatician/scientist/whatever
Apart from the possibility of working
massively collaboratively, a whole social ecosystem has sprung up around these
tools. There’s the obvious: Github is
integrated professional social networks such as LinkedIn and serious job
websites such as Stack Overflow Careers. These integrations give hackers and
scientists the opportunity to showcase themselves in completely new ways. Ask
yourself: if you were an employer and were looking for a new bioinformatician/scientist/hacking
or developer and you had to choose between an applicant who (a) sends you the
standard cover letter/CV combo or (b) someone who – in his cover letter tells
you where his/her code can be found on github/bitbucket thus allowing you to inspect
it, who has a a Stackoverflow profile where they have answered technology
questions and their answer has been peer reviewed by their peers and they have
accrued reputation and standing? I know which candidate I would be much more
interested in. Clearly using git allows you to tap into this ecosystem. There
is no technical reason why this could not happen on something like SVN
etc….practically though, the ecosystem is not there.
The other aspect is social: Github has many
social components – and thereby signals which can be used for metrics. This, in
turn, has knock-on effects on developing measures of impact: new metrics
systems such as ImpactStory, for example, will track
not just your papers, citations etc, but also your open source contributions
via Github, the number of commits, followers, forks etc – it becomes one signal
in a more complete picture of the impact of a scientist/coder/engineer than
just traditional paper metrics.
The downside of all of this is, that, in a
way, it almost condemns you to participation. But I suspect that this is the
direction that knowledge work will take anyway – everything we do will become
increasingly social. And, of course, it will become a significant career
problem for those who don’t want to participate in these systems or can’t
because of, for example, institutional constraints.