Version Control for the Lone Scientist

I am an unabashed fan of distributed version control systems (DVCS) such as Mercurial or Git. And from time to time, I get drawn into discussions with friends and colleagues about the pros and cons of these.

One question in particular comes up time and time again: DVCS are complete overkill for the lone coder and - by extension - for the lone scientist. Here are some thoughts on version control for computational scientists, working alone or collaboratively.

In summary, I think one can get at this from several angles: (a) management of change and a state of mind, (b) the reproducibility of the computational experiment, (c) showcasing yourself as a researcher/hacker and novel measures of impact. These all overlap to a certain extent.

Management of Change and a State of Mind

This is the obvious one. Version control systems manage change – that is the trivial and obvious thing to state. But they do it in very different ways.  Version control systems such as CVS or Subversion are in essence feudalistic models of working: a central server holds a canonical version of an artifact (source code, an ontology, a piece of writing), which gets pushed to clients. Because of the feudalism, this means that “commit” equals “inflict”: someone commits a change to an artifact and it gets inflicted on al the clients working with the same repository. So what are/were the consequences of this? No atomic commits (I realize that the discussion as to whether atomic commits are a good thing or whether even broken code should get checked in is one for an evening in the pub), code hardly ever got checked in.

Contrast this to distributed version control sytems. Here, there is a staging system. Code exists in the repository on your machine and you develop on this. Code may also exist on another machine/server/host such as Bitbucket or Github, which may or may not hold a canonical version. In any case – commit here isn’t inflict because it takes a “push” operation to add the code you may have just committed to your local repository to the remote one. Furthermore, the commit into your local repository is not coupled to a push and hence “commit” is not the same as “inflict”.

Typically, the results are more commits at least locally and the preservation of work. And this makes sense even for the lone developer.

Another aspect in this discussion concerns the way in which changes are tracked by these systems: subversion and others of a similar ilk track versions – changes in the file system – whereas git and Mercurial track what has actually been changed. Again, an almost trivial statement to make, but it has huge implications. Merging becomes much, much, much easier that way – resulting in more branching, more commits, more experiments. That’s a good thing – particularly as a scientist. Much work in computational and data science involves parameter sweeps – running the same protocol again and again – but with altered parameters each time this is done. Developing workflows and computational procedures quite often require experimentation – starting from a baseline script, branching, making changes, merging these back, branching again, experimenting etc…and the commit and branching mechanisms in version control systems can be used to track and document these experiments: it is a step towards reproducible computational science.

It is also a state of mind: the staging involved in push and pull mechanisms in addition to a commit enables distributed and therefore massively collaborative working. And sooner or later even the loneliest of lonely scientists will have to engage in this way of working, if he or she wants the world to acknowledge and take up the work that has been done. The way of working is so powerful, that software development tools such as git and mercurial are now also used to author legislation (http://www.quora.com/Ari-Hershowitz/Posts/Hackathon-Anyone-Recode-Californias-Laws), to distribute information (German law, for example, is available on Github and even to figure out http://www.wired.com/wiredenterprise/2013/01/this-old-house/. Bottom up, massively collaborative ways of working are ways of working of the future – distributed version control systems are one embodiment of this mindset.

(b) The Reproducibility of the Computational Experiment

This is picking up the discussion begun in the previous point. When taking version control systems and combining them with conventions around organizing the other components of, for example, bioinformatics projects, we might be able to tackle issues of reproducibility of computational experiments/investigations a bit better. There has been some discussion around this on Biostars and also in the literature, most notably in a paper by Noble about organizing Computational Biology Projects and our Lensfield Paper from a while back. Version control here fulfills three functions (a)  backup, (b) the keeping of a historical record of work done and (c) enabling concurrent work by multiple collaborators, which may sooner or later happen to even the lonliest of scientists.

(c) Showcasing yourself as a hacker/developer/bioinformatician/scientist/whatever

Apart from the possibility of working massively collaboratively, a whole social ecosystem has sprung up around these tools.  There’s the obvious: Github is integrated professional social networks such as LinkedIn and serious job websites such as Stack Overflow Careers. These integrations give hackers and scientists the opportunity to showcase themselves in completely new ways. Ask yourself: if you were an employer and were looking for a new bioinformatician/scientist/hacking or developer and you had to choose between an applicant who (a) sends you the standard cover letter/CV combo or (b) someone who – in his cover letter tells you where his/her code can be found on github/bitbucket thus allowing you to inspect it, who has a a Stackoverflow profile where they have answered technology questions and their answer has been peer reviewed by their peers and they have accrued reputation and standing? I know which candidate I would be much more interested in. Clearly using git allows you to tap into this ecosystem. There is no technical reason why this could not happen on something like SVN etc….practically though, the ecosystem is not there.

The other aspect is social: Github has many social components – and thereby signals which can be used for metrics. This, in turn, has knock-on effects on developing measures of impact: new metrics systems such as ImpactStory, for example, will track not just your papers, citations etc, but also your open source contributions via Github, the number of commits, followers, forks etc – it becomes one signal in a more complete picture of the impact of a scientist/coder/engineer than just traditional paper metrics.

The downside of all of this is, that, in a way, it almost condemns you to participation. But I suspect that this is the direction that knowledge work will take anyway – everything we do will become increasingly social. And, of course, it will become a significant career problem for those who don’t want to participate in these systems or can’t because of, for example, institutional constraints.