How to translate English preprints into more than 100 languages

This free tool allows users to read preprints in any of the 104 languages recognized by Google Translate.

6 March 2020

Jeffrey M. Perkel

Nicolas Menijes Crego/Alamy Stock Photo

Science may be a global enterprise, but it communicates mostly in one tongue - English. And that can be a problem if you’re not fluent in it.

There is an enormous community of researchers who are not native English speakers, yet they are expected to write their articles in English to a very high standard.

They are also expected to be on top of everything that's been published in English in their subject area, says Humberto Debat, who studies crop diseases at the National Institute of Agricultural Technology in Cordoba, Argentina.

After ruminating on this problem for years, Dubat teamed up with Rich Abdill, a computational biology PhD student at the University of Minnesota in Minneapolis, to create a tool that can help.

Humberto Debat

Humberto Debat

Launched in January, PanLingua allows scientists to search the bioRxiv preprint server - where papers are written entirely in English - in any of the 104 languages recognized by Google Translate.

The tool translates the query into English, uses it to search bioRxiv, translates the results back into the user’s language, and displays them in the browser.

A nod to a universal language

PanLingua came to life at a conference in San Francisco, where Debat gave a short presentation on his vision for a system that could search preprints in any language. In the audience, Abdill realized he was in a position to help make that happen. He coded up a prototype and showed Debat the next day.

Abdill is the developer behind rxivist.org, a separate metadata search tool that allows users to search bioRxiv preprints based not only on keyword and author information, but how popular they are on Twitter, or how many times they have been downloaded.

Rich Abdill

Rich Abdill

“[PanLingua] was a project that was totally conceived and designed by Humberto,” Abdill explains. “He said, ‘I have this project and I've got everything except the code’, and here I am sitting in the room being like, ‘All I have is code - I can fill in that gap for you’.”

The pair named it after a universal language conceived by Argentinian artist, Xul Solar.

Many web browsers already provide automatic translation of foreign-language pages, including (when reached from non-English-speaking countries) bioRxiv. What PanLingua provides, Debat says, is the ability to search for articles in languages other than English.

“The problem that PanLingua intends to solve is that users may not reach those articles nor know they exist.”

Already in use

Since its launch in mid-January, PanLingua has had about 800 visitors. “That's far less than we're guessing the number of people that might be able to make use of this,” says Abdill.

One of those visitors is Daniel Prieto, a developmental cell biology postdoc at the Clemente Estable Institute of Biological Investigations in Montevideo, Uruguay, who can both read and write in English. He says he uses PanLingua for about 30% of his work on bioRxiv, at least partly because he wants to support the project.

But it’s also easier to work in, he says: “It gives me the impression that things get more familiar,” by which he means it feels more natural or comfortable.

And the translations are “quite good”, says Prieto, based on his experience with his own preprints. “The phrasing is not the same, but it gets the general idea,” he says.

Pablo J. Sáez, an English-speaking Chilean cell biologist now working at the Institut Curie in Paris, agrees. Sáez is fluent in Spanish, Portuguese, Italian, and French, and translations in all those languages were “quite accurate”, he says.

As the artificial intelligence that drives Google Translate improves, so too will the translations on PanLingua. But a translation doesn’t have to be perfect so long as it is good enough to get the idea of the research across, Abdill says.

“'Close enough' is not an idea that's practical for a novel or a poem, but if you're trying to figure out, ‘Is this a paper I need to pay more attention to?’, we can get the gist from a machine translation,” he explains.

Resurfacing lost knowledge

Though most researchers at his institute can speak and write in English, says Debat, PanLingua could prove useful to those people on staff who perform outreach with the broader community, many of whom are more comfortable working in Spanish.

Sáez says the tool could aid university undergraduates who have not yet completed their training and thus do not read English “with a high level of comprehension”, as well as those who perform scientific outreach to children and high-school students.

For the moment, PanLingua works only with bioRxiv - a function, Abdill says, of the way sites such as PubMed and other preprint servers render their pages.

But Debat hopes it may ultimately be possible to apply this type of approach to other databases and archives, and even to archives of papers not written in English.

One 2018 letter to Nature estimated that some 79 million articles have been published in Chinese since 1979, making their findings invisible to that portion of the scientific community who cannot read them.

The opportunity cost of that lost knowledge is incalculable, precisely because the researchers who cannot read that literature do not know what they are missing.

The current coronavirus outbreak provides a case in point, says Prieto.

“Chinese colleagues have made a lot of effort to get things published in English to keep the world up-to-date,” he says.

“But what would have happened with this coronavirus knowledge if [the virus] hadn't provoked this kind of outbreak? It would probably have been published in Chinese. And there would be a corpus of a hundred papers about this virus that nobody would notice.”

Jeffrey M. Perkel is a Technology Editor at Nature.

Tags:

Research Highlights of partners

Return to 'News'