Q&A Adam Russell: The search for automated tools to rate research reproducibility

A US project is exploring the use of software to assign confidence levels to published research.

Dalmeet Singh Chawla

Q&A Adam Russell: Can automated tools reliably rate research reproducibility?

A US project is exploring the use of software to assign confidence levels to published research.

8 August 2018

Dalmeet Singh Chawla

Grandfailure/Getty

Adam Russell

Recently the United States Defense Advanced Research Projects Agency (DARPA) solicited research proposals on developing automated tools that assign confidence levels to published research, as part of its ‘Systematizing Confidence in Open Research and Evidence’ (SCORE) initiative. Anthropologist Adam Russell, currently a programme manager at the agency, spoke to Dalmeet Singh Chawla about the project.

Why is DARPA interested in new tools that can assess the degree of reproducibility and replicability of studies in the social and behavioral sciences?

The ability to reproduce and replicate results and claims are hallmarks of scientific progress, without which we struggle to differentiate between chance and bias versus saying something true. But science is difficult, and often expensive, so evaluating claims can be challenging, especially for non-experts who nonetheless often rely on those claims to build models or make decisions.

I think we have new opportunities to detect and aggregate a lot of ‘weak signals’ about scientific claims. These give us potential new capabilities for assigning a kind of credit score to the degree of reproducibility and replicability of claims. Some evidence suggests that, like a credit score, these weak signals will be highly varied and, in some cases, possibly non-intuitive and unconventional. So we might be looking at combinations of endogenous and exogenous signals.

Endogenous signals are inherent to a study or article, such as evidence of questionable research practices, sample sizes, pre-registration of the research, open data, shared code, and so on. Exogenous signals exist in the wider research ecology, such as author and institution reputation, impact factors, peer review networks, social media commentary, post-publication review, ‘tighter or looser’ scientific communities and cliques, sources of funding, potential conflicts of interest, rates of retractions in that discipline and more.

How are you defining reproducibility and replicability?

For SCORE, reproducibility is being defined as the degree to which other researchers can successfully compute an original result when given access to a study’s underlying data, analyses, and code. Replicability is being defined as the degree to which other researchers can successfully repeat an experiment.

This is important because we must have appropriate levels of confidence in a claim to avoid either dismissing or over-weighting that claim. DARPA doesn’t anticipate being the be-all and end-all in this area, since there is a lot of work to be done. However, DARPA often does have the ability to bring different communities together to make accelerated progress against problems by helping to standardize methods, definitions, and metrics. DARPA has a history of helping to create new definitions or community standards that enable leap-ahead technologies, such as the DARPA Grand Challenge for self-driving vehicles or the DARPA Cyber Grand Challenge. SCORE may be in an area where that’s now feasible and useful.

There are a few existing efforts that measure reproducibility of scientific papers. What are you seeking to improve on?

Whilst efforts like the r-factor (which measures the veracity of journal articles from the number of other studies that confirm or refute its findings) are valuable, they often focus on one or two sources of information. They can be fraught with their own challenges, such as not being able to account for publication bias, meaning the demonstrated tendency for positive findings to be more favourably reported in scholarly literature than potentially equally valuable but negative (or null) results.

In contrast, DARPA is looking to build upon recent evidence that no single signal is likely to be as informative as many weak signals. For example, forecasting provides various proofs of principle that many weak signals, if identified and correctly integrated, can provide valuable new insights – and, importantly, may be less susceptible to gaming. That’s because there are so many signals that would have to be spoofed, it might ultimately be easier to just make sure research is reproducible and replicable.

And, like a credit score, there are lots of weak signals that can be aggregated now using automated techniques that weren’t available even five or ten years ago. Now is the time to push and see if we can build on tools like the r-factor to build something more comprehensive.

Irreproducibility can be caused by many different factors such as statistical mistakes, cutting corners with methodology, lack of standardised reporting of materials, academic fraud and more. Do you think automated tools can account for all these dimensions?

I don’t think we know yet, but we will explore these questions heavily in the program. Any research claim will be somewhere along a spectrum of reproducibility and replicability due to many reasons such as these. But do we have to account for all these things in order to assign accurate confidence scores? Is it possible that some of these signals are more powerful or more informative than others? The most useful signals of high or low confidence results may also be informative – such as whether a result is low confidence because signals suggest it is unlikely to replicate, or because it’s so novel. Low confidence could be less a signal that a claim is wrong or misguided, and more that we need additional work in an area.

Academics are often critical of simple solutions to complex problems such as traditional metrics that aim to measure the impact of a researcher or their work. How will DARPA make sure its tools are not seen as another crude approach to a problem that is hard to measure? Why should people have more confidence in automated tools than human judgment?

The plan is not to replace humans with machines, but rather to find the best ways to combine the two. I don’t think people should have greater confidence in automated tools compared to human judgment, until the evidence suggests they should. One point of the program is to explore whether automated tools can give us at least the equivalent of what the best human experts can give us, but at greater speed, scale, and reliability.

We aren’t creating these tools to police the academic community. DARPA doesn’t anticipate that they will be used for tenure decisions or assessing the impact of any given researcher’s work. It’s more geared towards users of social and behavioral science research to help them calibrate the confidence they should have in a particular research claim, which could impact the extent to which they rely on a particular model or how they make a decision. While it’s a bit early at this point to speculate too far, there may be value in having a tool that can help researchers potentially assess the reproducibility and replicability of their own research, and combat entirely understandable biases, which could have positive impacts on the wider research ecology.

Importantly, we plan to make tools that are transparent to users, and we are very explicitly not interested in black-box algorithms that spit out results without explaining how they got them. This is because low confidence in a claim because it’s novel is very different to low confidence due to questionable research practices. And knowing why a claim is low confidence ought to shape how it’s treated, used, or disseminated.

Why focus on social and behavioral sciences? Why not apply the tools to the natural or physical science disciplines as well?

It’s hard to imagine a problem that’s important to national and international security that doesn’t somehow involve understanding human social behavior, and as we are moving into an era of increasingly complex social systems and interactions, leveraging social and behavioural sciences seems vital. So knowing what kind of confidence one should have in certain research claims could be critical for making progress in solving many of these important problems.

If we are successful, I would like to think our work might be applied to other disciplines, but this would almost certainly require additional research to test this. It’s also worth noting that researchers in psychology and several other social and behavioral science disciplines have been very public about their reproducibility challenges, so we feel like there’s an important discussion going on that can help further motivate this research and engage wider communities.

Do you believe the irreproducibility issue will ever be completely solved?

No, I don’t think irreproducibility will ever not be a feature of research. Science is really difficult and messy. But I do think the research community more generally can make improvements in where, how, and why irreproducibility occurs, and help limit the extent to which it impedes important scientific progress.

SCORE aims to develop tools to help accelerate our progress on that path.