When it comes to reproducible science, Git is code for success
And the key to its popularity is the online repository and social network, GitHub.
11 June 2018
Early in his graduate career, John Blischak found himself creating figures for his advisor's grant application.
Blischak was using the programming language R to generate the figures, and as he iterated and optimized his code, he ran into a familiar problem. Determined not to lose his work, he gave each new version a different filename, underscore-1, underscore-2, and so on, but failed to document how they had evolved.
“I had no idea what had changed between them,” admits Blischak, who now is a postdoctoral scholar at the University of Chicago. “And if the professor were to come back and ask which version I had used to create this figure, I would have had no idea."
Later, while attending a workshop on basic research computing skills, he discovered a better approach: Git.
Git is a free and open-source distributed version-control system. Written to manage development of the open-source Linux operating system, it allows large teams of programmers to work independently on their own copies of the code, track changes with line-by-line granularity, merge those changes back into the main repository, and roll back changes as necessary.
Using Git, Blischak says, he no longer needed to maintain multiple copies of his files. "I just keep overwriting it and changing it and saving the snapshots. And if the professor comes back and says, 'you sent me an email back in March with this figure', I can say, ‘OK, I'll just go back to the March version of my code and I can recreate it’.”
Git also facilitates scientific reproducibility across a wide range of disciplines, from ecology to bioinformatics, archaeology to zoology. When used in conjunction with GitHub, an online Git repository hosting service and social network, the tool allows researchers to store and share their code, analyse scripts, and data, and to ensure analyses are always executed using the appropriate versions of the files.
Other researchers can then access those files to see how the work was done and to apply it to their own studies — features that advance research transparency, says Juan Antonio Vizcaíno, proteomics team leader at the European Bioinformatics Institute in Cambridge, United Kingdom. Making code, data, and analysis pipelines available to the research community, he explains, “in principle makes it easy to reproduce, or makes it possible to reproduce the analysis by different people.”
GitHub’s enormous popularity — it has more than 28 million users — made it attractive to Microsoft Corporation, which has announced plans to acquire it for US$7.5 billion. The implications for programmers, and the scientific community, remain unclear. Microsoft says it wants to accelerate enterprise use of GitHub and expose new audiences to Microsoft developer tools and services.
Despite its popularity, Git also is a tool that researchers love to hate, with a vexing and confusing command-line interface intended more for seasoned programmers than software newbies.
Greg Wilson, who co-founded the computational-science workshop series, Software Carpentry, is blunt in his assessment. Though he recognizes its value, and uses it himself, he says “I hate Git. It is one of the worst pieces of software to teach that I've come across in 35 years of teaching software.” But, he adds, mastering Git is as essential to modern research as learning to read English. Those who use Git and have become immune to its complexity, he jokes, suffer from ‘Git-induced Stockholm Syndrome’.
Titus Brown, a bioinformatician at the University of California Davis, calls Git a “power tool”, one whose power comes at the cost of complexity.
Git is not the only tool of its type. Other options include Mercurial, a more user-friendly alternative. So, why is Git so popular? In a word, GitHub.
With some 85 million software repositories, GitHub is an elegant (and largely free) online portal built on Git's foundations, on which programmers and researchers can archive, share, discuss, and edit their code, manuscripts, documentation, and data. The site provides a convenient online home for projects, as well as a backup in case a user's local copy is ever damaged.
Much more than a code warehouse, GitHub is effectively a social network for software development. Programmers can ‘fork’ any user's public project (that is, make their own copy), modify it, and use that updated code — or any previous version thereof — on their own data. They can then make that updated code available to the community, a form of “permissionless” editing that is “tremendously powerful,” says Brown. Some journals even run their peer-review processes on GitHub.
“Communication and collaboration are the killer apps of version control,” wrote The University of British Columbia statistician Jenny Bryan in a recent article. And that is equally true for bench scientists as for programmers. In one recent example, Casey Greene of the University of Pennsylvania Perelman School of Medicine and Anthony Gitter of the University of Wisconsin, Madison, led a team of more than 40 researchers who collaborated to write an extensive review on a form of artificial intelligence called ‘deep learning’. The project, which they called ‘the deep review’, was managed (and eventually automated) entirely on GitHub.
Brown's lab uses GitHub as a way to manage collaborations, control access, run automated quality checks, and provide a 'canonical' copy of its code. And they use it also to advance reproducibility, Brown says, by allowing team members to identify and retrieve precisely those versions of their code, scripts, and data they originally used to perform a particular analysis.
Tracy Teal, executive director of the Carpentries, the parent organization that oversees Software Carpentry, says that Git and GitHub are even useful for those researchers who like to work solo. “Most researchers are primarily collaborating with themselves,” Teal explains. “So, we teach it from the perspective of being helpful to a ‘future you’."
But there are alternatives to GitHub, including GitLab and Atlassian’s BitBucket. Both have reported sharp spikes in new users in the wake of Microsoft’s announcement.
Learning the basics
Whatever platform you use, Git can be a daunting prospect for the uninitiated, but the basics are straightforward enough — see here, here, and here for good tutorials. The program is text-based, but several free graphical user interfaces are available, including SourceTree and GitKraken. And many programming tools, such as RStudio, feature Git integration as well.
But there’s no arguing the tool is complicated, and things often can go wrong — a fact this author has experienced first-hand. In that case, Blischak urges perseverance. “Appreciate the fact that it’s going to be a little complicated to start off. And don’t get discouraged if it seems a little overwhelming at first, because that's how everybody feels.”
Indeed, getting a feel for the software in a non-mission-critical situation may be the best way to learn Git, says Teal; that way, instructors can walk you through the hiccups that inevitably arise. "I won't pretend that it's not a challenge. It's just one that's worth it," she says.
At the Carpentries, instructors don’t teach reproducible research per se; rather, they see reproducibility as a consequence of good scientific practice. “We see reproducibility as a ‘lagging indicator’ rather than a ‘leading indicator’,” Teal explains. “Reproducibility comes from good practices in doing research coding.”
Jeffrey Perkel is Technology Editor, Nature.