Technology feature

18 June 2019

A “petting zoo for code” makes studies easier to reproduce

A new tool helps users to compose, compute and publish reproducible articles.

Jeffrey M. Perkel

Credit: Fanatic Studio / Alamy Stock Photo

A “petting zoo for code” makes studies easier to reproduce

A new tool helps users to compose, compute and publish reproducible articles.

18 June 2019

Jeffrey M. Perkel

Fanatic Studio / Alamy Stock Photo

Will Ludington

Ben Marwick

When biophysicist Will Ludington came to write up a recent paper about what he calls “the algebra of the microbiome”, he and his coauthors specifically took computational complexity into account.

Ludington’s study involves the interplay between the microbiome and its host: whether the host (in this case, fruit flies) responds to each microbial species specifically, or to the interactions between them. Short answer: both sometimes are true.

Working that out involved some 25 pages of mathematical exposition, plus computational code and experimental data to make it real.

Ludington posted the raw data on the online repository Dryad, but to make the information more accessible to readers, he also published the figures - with the underlying code and data - on the new web-based computing platform, Nextjournal.

Joseph Kliegman, chief scientist at Nextjournal, describes the tool as “a cloud-hosted notebook for reproducible research that allows users to easily compose, compute and publish perfectly reproducible articles.”

The Berlin-based company, whose product opened for public beta testing on 6 May, offers one of a small number of services -- others include Binder and Code Ocean -- that allow researchers to share not only their code and data, but also the computing environments used to execute them.

In so doing, these tools make it easier for researchers to study, reuse and adapt their colleagues’ computational tools, potentially boosting the impact of their research, while also freeing them from many of the headaches that would otherwise arise.

Computational petting zoo

Research software often comprises custom code and scripts, which rely on a fragile web of software ‘dependencies’ -- code libraries that encapsulate the instructions for, say, running statistical tests, or plotting data. If a user’s web is just slightly different, the software could fail.

As a result, users who wish to actually execute the code must spend considerable time downloading, installing and configuring the missing components. Many simply give up.

In one recent study posted on the bioRxiv preprint server, an international team of researchers found that 57.1% of 98 computational biology tools could not be installed simply by following the published instructions, and two-thirds of those failures couldn’t be installed even after manual intervention.

For Ludington, who runs a lab at the Carnegie Institution for Science in Baltimore, Maryland, this is a major roadblock in ensuring research transparency. “I shouldn't have to fight to get somebody else's analysis to run,” he explains.

By using services like Nextjournal, he says, users are spared the fight.

At their heart, what Nextjournal, Code Ocean and Binder provide is a way to execute code without software installation. They do this by collecting and encapsulating all the required components in a self-contained computing environment called a ‘Docker container’, and providing an interface to work with and execute that code in the cloud.

The result, says Ben Marwick, an archeologist and reproducibility advocate at the University of Washington in Seattle, who has used Code Ocean to publish his computing environments, is like a computational “petting zoo” -- a platform for making code and data available to the scientific community for interactive exploration, validation and modification.

Community access

Researchers have used these and other platforms to make published code and data available to the scientific community.

Several journals have integrated Code Ocean compute capsules in their published articles, for instance, and some Nature Research journals use the platform for peer-review as well.

And in February, the open-access journal eLife published its first proof-of-principle computationally reproducible article, in which the code (and the underlying computing infrastructure) were embedded directly into the article itself.

It can be difficult to get these computing environments up and running. While the code worked on our own computers, we required assistance from both Code Ocean and Nextjournal to get our documents working on those platforms. And with Binder, we were only able to get one of two notebooks working.

In a sense, then, our experience recapitulates what researchers deal with on a daily basis. But by encapsulating their code, data and computing environment, says Code Ocean CEO Simon Adar, researchers can ensure that their code will work now and in the future, both for themselves and their colleagues.

And that can pay unexpected dividends.

When the authors of the recent bioRxiv study counted how highly cited the papers describing each tool were, sorting the results by how easy those tools were to install, they found a correlation.

“Tools which we were able to install had significantly more citations compared to tools which we were not able to successfully install within two hours,” the authors wrote.

Making computing environments easily accessible makes it relatively easy for authors to extend the impact of their work to the research community in unprecedented ways, Marwick adds. “It reduces the barrier for people to interact with published research”

Binder: manage digital collections

One easy option for creating such environments is Binder. Binder is a free software tool that allows researchers to execute, in their web browsers, computational notebooks -- web pages that combine programming code, explanatory text and images -- that have been deposited on a public Git version control repository, such as GitHub.

Users simply have to access mybinder.org, provide the location of the software repository, and Binder will build and launch a Docker container to run it.

Marwick says he uses Binder to spin up tools that he finds on GitHub. “Then I can interact with it in a way that feels natural to me.” See our example here

Other tools enable users to author new code in the cloud.

Nextjournal: be the author

Nextjournal, for instance, allows users to write computational notebooks. Researchers can create or import notebooks in a number of different programming and scripting languages, Python, R and Clojure, the language in which Nextjournal itself is built.

They can create computing environments that mix and match those languages in a single document, though data objects cannot yet be passed back and forth from language to language. (According to Kliegman, this feature is in development.)

Data are embedded within the notebook itself. (See our example here).

Readers can download those files to their computer if they want to analyse the data using their own code, or ‘remix’ published notebooks in order to alter and experiment with the code and data themselves. They can also save the computing environment for sharing and reuse, and group-edit the document as if it were a Google Doc. All changes are logged via built-in version control.

Kliegman says the company sees Nextjournal “as a tool to bring authors and publishers closer together” -- a platform with which researchers can bundle code, data, figures and text in a single document that they can submit for publication. No journals are yet set up to use such a document, but the company has “a number of conversations” ongoing with scientific publishers, Kliegman says.

Nextjournal is free for researchers who are willing to make their notebooks public, providing up to 4 central processing units (CPUs), 1 graphical processing unit (GPU), 15 GB memory and 500 GB of storage.

Paid accounts, providing up to 96 CPUs, 8 GPUs, 624 GB memory and essentially limitless storage, are available for individual users and enterprise subscriptions. “We’re actively trying to figure out ways to be more inclusive with our pricing schedule,” says Kliegman.

Code Ocean: the cloud-computing platform

Other platforms for sharing computational notebooks also exist, including Google Colaboratory and Microsoft Azure Notebooks; see here for a review.

Code Ocean is more of a general-purpose cloud-computing platform. Working in the web browser, users can create, edit and upload code and data, and log their changes using the Git version control system. (See our example here)

By default, Code Ocean presents users with a coding interface that is decoupled from the underlying computing architecture; to execute it, users must click a button to create and launch a Docker container, which runs the code.

Alternatively, Code Ocean users can spin up a cloud-based computer, called a ‘Cloud Workstation’, and develop their software there as if on their own machine. Either way, the result is a Docker-based ‘compute capsule’ of code, data and computing resources, which can be published, exported and assigned a DOI. (Nextjournal can also mint DOIs.)

Code Ocean offers a no-cost ‘researcher’ plan providing 5 GB of storage and 1 hour of computing time per month, or 20 GB of storage and 10 hours/month for academic users. Team and institutional accounts, providing additional computing and storage resources as well as administrative and collaboration features, are also available.

Updated 19 June 2019: Thanks to Anthony Gitter (@anthonygitter) of the University of Wisconsin-Madison, both notebooks now run in Binder. See here: https://github.com/jperkel/Mapping-tools-examples

No paper, no PhD? India rethinks graduate student policy

The top 5 most talked-about studies of April