How freely should scientists share their data?
The Open Science movement champions transparency, but how much and how quickly is a matter of dispute.
16 August 2018
At the beginning of graduate school, I decided I wanted to study how epileptic seizures damage the brain. I was in something of a pickle: I wanted to use magnetic resonance imaging (MRI) to study this damage, but I didn’t have access to MRI data of patients with epilepsy. Even if I had that data, I didn’t know much about programming or mathematics or physics, so I couldn’t have created ex nihilo the software tools to analyze the data anyway. So, I was driven and energetic and wanted to study epilepsy, but I didn’t have the data or tools to work with.
But other people did. With the help of my mentor, I struck up collaborations with groups at the University of Texas at Houston and New York University, who shared high-quality MRI data of patients with epilepsy, free of charge. I corresponded with researchers at Oxford and Harvard to learn how to use their MRI analysis programs, which they also shared free of charge. This model for sharing data and software tools made a deep impression on me. Everyone benefited; I was able to study epilepsy, my collaborators were able to reuse data that was otherwise gathering dust, and we were trying to improve the way we treat patients.
At around this time, I first heard about the Open Science movement — the increasingly popular belief that scientific methods and data should be freely available. The overall goal is to make science as democratic and accessible as possible. To do this, Open Scientists make their data, methods and code (computer programs that analyze data) openly available to the public. Open Scientists also share with their colleagues, which, as I discovered as a graduate student, can be a great boon to science.
I also heard cautionary tales that the Open Science movement had a dark side, that “openness” had, at times, devolved into bullying and theft. Some compared the Open Science movement to Communism: good in principle, impossible in practice. In informal settings — at dinner, over drinks — I was reminded that science was a competitive business.
But I didn’t worry that much until early this July.
Public humiliation in the name of open science
Jack Gallant is a cognitive neuroscientist at the University of California, Berkeley, who in 2016 showed us what listening to the Moth podcast does to our brains. A few years before this, he showed us he could — based only on measures of brain activity — actually reconstruct images of movies people were watching. If he had lived three hundred years ago, he might have been declared a wizard.
Gallant’s projects are scientific home runs. They’re sexy — so much so that his analysis of the Moth podcast was published in Nature, where it was touted with a professionally produced infomercial-style video. Freakonomics and NPR interviewed him.
Behind Gallant’s success was a mountain of money and data. Scientists propose research ideas and compete for grant money. Money allows scientists to collect data, which then allows them to test their ideas. Publish, repeat. Over the past two decades, Gallant’s winning ideas have made him a prominent neuroscientist who runs a successful lab, something akin to a high-ranking official within the scientific enterprise.
Because he and his work are so well known, it was particularly jarring to me when Gallant’s colleagues publicly humiliated him on Twitter.
On July 4, Gallant (@gallantlab) was promoting Open Science, tweeting forth about free access software platforms. Gallant argued that giving away free code is pointless if it only works within an expensive software program, which, he continued, is “NOT open code, it is a walled garden.”
“Nice advice. But what about data?” Manilo De Domenico (@manlius84), a theoretical physicist, tweeted the next day, “We keep trying to ask access to data used in your nature 2016, but we received not a single reply, yet. #opencode #opendata”
“Hi Manlio sorry for the lack of a reply,” Gallant replied. “The original authors are still writing further primary research papers on these data so they haven't been released yet but we expect to be able to do that very soon.”
“‘We still want exclusivity to publish more papers’ isn’t a great excuse. Did you note data restrictions in the manuscript?” tweeted Andre Brown (@aexbrown), referring to Nature’s policy that, on publication, authors should make their data, code and protocols “promptly” and publicly available. (Note the words on publication.)
It appeared that Gallant had transgressed fundamental principles of Open Science — perhaps even Nature’s policy. Was Gallant an honorable scientist or a devious hypocrite?
In subsequent tweets, De Domenico lamented that Gallant’s paper had given him a series of ideas that he wanted to test but couldn’t because he needed Gallant’s data. “This is not advancing human knowledge,” de Domenico asserted.
Gallant dug in: “And why do you assume that your project is better than the ones that we are continuing with these data? My students and postdocs are an awesome group of people, the stuff they have in the pipeline is great! But I can’t afford for them to be scooped.”
Later, Gallant re-affirmed his commitment to Open Science. He pointed out that he had shared many datasets in the past and detailed his reason for not (yet) sharing this particular dataset: complex data takes time to understand and his lab wanted to better understand the data before releasing it to the world. In essence, Gallant argued that since his lab competed for and won money to collect the data and then worked to collect it, they should have first dibs to work on it.
The (academic) Twitterverse was up in arms. Gallant’s we’re working was called a “nonsense excuse,” “scandalous” and delusional. The debate spilled through multiple threads for nearly two weeks. Gallant’s public shaming boomeranged to Nature’s Web site, whereon someone named Richard Senate (perhaps a pseudonym?) trumpeted, “Jack Gallant refuses to share the data (in violation with Nature’s Journal Policy and with his NSF grants).” Later, David Eccles, a bioinformatics researcher in New Zealand, posted a mash-up of Gallant’s tweets — onto Nature’s website. Some called for Nature to boycott Gallant and to retract his paper.
Throughout this back-and-forth, I was glued to my Twitter feed. This was the first time that I’d seen well-established academics, from behind a keyboard or smartphone’s digital courage, publicly shame their colleagues. This was the first time that I’d seen the principles of Open Science marshalled to harm someone’s career.
This made me question the ideals of Open Science: a highly-productive lab writes a grant to fund a series of studies and the development of new tools. They spend years collecting data and building the tools for these proposed studies. Then, they finish a portion of the project and begin to publish results. Should they be required to release their data to the community? If so, when? Who owns that data? And what business do journals have in enforcing data sharing?
The opening of clinical trials
One of the first examples of Open Science began in the 1990s with practical problems facing clinical trials: clinical trials are expensive, take a long time and represent thousands of hours of work from researchers and voluntary participants, often patients. Without a central way to document ongoing and completed trials, two groups or companies might unknowingly be testing the same drug.
A trial might end and because it was a null result, never be published; then another group might wander down the same pharmaceutical dead end. In addition, there was (and is) a widespread concern that data obtained and reported from clinical trials needed greater transparency, accountability and evenhandedness.
In 1997, the U.S. Food and Drug Administration (FDA) began requiring clinical trials to register at ClinicalTrials.gov. This allowed prospective trials to look at the registry and make sure no one was already working on the same thing.
At the same time, the European Medical Agency (EMA), tasked with approving drugs within the European Union, began to increase public access to clinical trial data. Although data about individual participants in a trial was initially considered confidential commercial information (and so not publicly available), the EMA subsequently changed their position, citing (among other reasons) the public interest. The more access the public has to this clinical trial information, they reasoned, the more the data could be understood and thereby used to improve patient care.
The U.S. government didn’t follow suit. Instead of participant-level data, since 2007 the U.S. requires only that “summary results information [be] submitted and posted in a timely manner,” and this is only if the trial received funding from the NIH. The U.S. treats participant-level data as “proprietary data” owned by whatever institution worked to collect it. It is not owned by the researcher or researcher’s lab; it is not owned by the scientific community or by the journals that eventually publish the results.
In the U.S., data is legally protected intellectual property and can lead to patents. Patents exist, in theory, to protect and encourage the financial investment required to commercialize scientific ideas. Because publicly releasing intellectual property can jeopardize patentability, publicly releasing data could ruin opportunities to translate a nifty scientific idea into a tangible product that can change lives; in other words, it could ruin one of the primary goals of the scientific enterprise.
Of course, not all scientific studies will produce patentable intellectual property (consider general relativity), but sometimes they do (consider MRI); so, these discussions are complex.
Even legislators have weighed in on this complexity: “I appreciate that there are many policy, privacy, and practical issues that need to be addressed in order to make data sharing practical and useful for the research community,” U.S. Senator Elizabeth Warren wrote in 2016 in The New England Journal of Medicine, “but the stakes are too high to step back in the face of that challenge.” Warren’s editorial went on to congratulate a recent decision by journal editors to circumvent these “practical issues.”
Earlier that year, journal editors had banded together and decided that, if scientists, funding agencies and even Congress (!) couldn’t agree to require scientists to publicly release their data, they could make it a requirement for publication.
Journal editors as arbiters
In February 2016, the 14-member International Committee of Medical Journal Editors (ICMJE) published an editorial in The New England Journal of Medicine. They announced that to be considered for publication in a member journal, authors would have to release their de-identified participant-level data presented in their research “no later than 6 months after publication.”
Later that year, in August 2016, a separate international consortium (representing 282 investigators in 33 countries) published a dissenting response in The New England Journal of Medicine that argued 6 months was too short.
“We believe 6 months is insufficient for performing the extensive analyses needed to adequately comprehend the data and publish even a few articles,” the group wrote. In any large grant application, scientists outline multiple hypotheses they wish to investigate through multiple analyses. Describing these analyses often requires a series of articles and, of course, time.
If required to give up exclusive access to data after their first publication, investigators “will effectively be competing with people who have not contributed to the substantial efforts and often years of work required to conduct the trial.” This dissenting group — while very much still in support of Open Science — argued that investigators should be allowed a minimum of 2 to 5 years to make their clinical trial data public.
Journals as executors
In 2015, Science ran a “Scientific Standards” editorial, prepared by the Center for Open Science’s Transparency and Openness Promotion (TOP) Committee. To motivate new Open Science standards, the editorial began by citing a 2007 survey of 3,247 NIH funded scientists that reported widespread “normative dissonance,” meaning people’s ideals and behaviors were misaligned.
Of the (many) possible causes for this dissonance, the TOP committee named three: “transparency, openness, and reproducibility are readily recognized as vital features of science… [and yet we have an] academic reward system that does not sufficiently incentivize open practices.” The TOP committee (recall, O = Openness) assumed that researchers wanted more openness but lamented that “there is no centralized means of aligning individual and communal incentives via universal scientific policies and procedures.”
They created a scheme wherein journals would be graded on their commitment to Open Science. Ranging from 0 (no Open Science policy) to 3 (release of data and materials is prerequisite to publication), the idea was that scientists would want to publish in journals with higher grades. Like a restaurant’s sanitary inspection grade, but for journals.
The TOP recommendations increased the scope of mandatory data sharing and changed the regulative authority in charge. Whereas government regulations were limited to clinical trials, the TOP scheme recommended that all scientific data be released at publication. They shifted power to execute the data management plan (that describes when/what data will be shared) from the contractual arrangement between funding agencies and institutions to the publication policy of journals. In this scenario, a journal could not allow scientists to publish unless they release their data. I think this shift in scope and power is significant and is where the Open Science movement breaks down.
Who benefits from open data?
I’m a clinician and, for me, clinical trials have a palpable immediacy: clinical trials sculpt my patient care, so I want to make sure the latest results are transparent and reproducible. If I base my clinical decisions on a biased trial with crappy analyses, at worst, I could kill people, and, at best, I could fail to help people. The immediacy of open clinical trial data is compelling; I benefit, and my patients benefit.
But not all data have this immediacy. Gallant’s data showing how the Moth podcast affects the brain is pretty far removed from life-and-death clinical decisions. Much of what I do as a researcher is pursue knowledge, far from life-and-death stakes.
In 2016, the Organization for Human Brain Mapping (OHBM) released a report describing that: “Data sharing is one of the cornerstones of verifiable and efficient research, permitting others to reproduce the results of a study and maximizing the value of research funds already spent.” The report goes on: “A comprehensive data management plan — that involves all authors, collaborators, funding agencies, and publishing entities — is essential no matter what is shared and should be considered from the outset of a study. Without such planning, in a jumble of folders and after a graduate student or postdoc has moved on, data can effectively be lost.”
(NB: the data management plan should be set in the original contract at the outset of a study, not to satisfy a journal or Twitter mob.)
This a straight-lined economics argument: it’s more cost-effective to pool our resources. In my 2015 article on brain imaging databases, I cited a conservative estimate that between 1990 and 2011, over 22,000 functional MRI studies were performed, representing an estimated 144,000 hours (approximately 12,000 datasets with about 12 subjects each and about one hour per subject) of scan time. At Yale, an MRI scan costs about $600 an hour, so this represents an investment of around $86.4 million for the data alone. So, data sharing makes good sense for the scientific enterprise; we all benefit.
But I remain unconvinced that there is an immediacy to sharing most forms of scientific data — especially an immediacy in the name of the public interest. I am convinced that other scientists feel an immediate need to analyze data sets that they do not own — especially if the results of a particularly excellent data set can be published in Nature and make them famous.
At some point in the Twitter debate, a principled plea for Open Science devolved into a principled justification for public shaming and, in a very real way, for demanding Gallant release his data before he was ready. Gallant himself wasn’t done testing and extending his own ideas: his 2016 Nature paper wasn’t the grand finale but rather a progress update. So, the urgency appears to stem not from a desire to advance human knowledge (prudence would permit Gallant to complete his contracted work) but rather from scientific competition (wherein someone stands to improve upon Gallant’s initial Nature report before Gallant himself is able to).
The absence of a clear, precise Open Science policy — from Gallant’s funding body, from Gallant himself, from Nature — allowed sufficient ambiguity to justify anger. And a public shaming.
What concerned me most in Gallant’s Twitter pile-on were demands that Nature sanction Gallant for not sharing his data, as if Nature were the executor of Open Science. In addition, although Nature did not fund Gallant’s research or have a say in Gallant’s original data management plan (the contract between Gallant’s institution and his funder), Twitterians wanted Nature to create and enforce its own data management plan; an Open Science lone ranger re-writing and enforcing federal law.
This is an untenable and ironic position for journals. Journals are businesses that profit at every step in the publication process — scientists pay to submit and publish their work, everyone else pays for access. On one hand, requiring scientists to release their data could be good business for journals — excellent data sets beget excellent papers. Nature Communications published a report to this effect this July. On the other hand, a dogmatic adherence to unpopular business strategies could lead to a journal’s demise because other competing journals will meet marketplace demands. Perhaps this is why most journals do not enforce data sharing policies.
Journals’ commitment to Open Science is necessarily tenuous. Journals make papers freely accessible, if the scientist pays a fee of up to $5,000. Otherwise the work remains behind a paywall. Nature’s final version of Gallant’s 2016 article, for example, lives within Nature’s “[pay]walled garden.” For $32, you can freely access it.
Further reading on data sharing
The Project Open Data Web site describes the core principles of open data and how these apply to (non-secret) data collected by the U.S. government.
The NEJM Web site curates an excellent series of editorials on data sharing that explore “diverse viewpoints from across the medical community.” Some of these are Open Access. My favorite editorial (much of my historical overview comes from this piece):
Preparing for Responsible Sharing of Clinical Trial Data, N Engl J Med 2013; 369:1651-1658 DOI: 10.1056/NEJMhle1309073.
This piece was originally posted on Scientificamerican.com