Taking the pain out of data sharing, 2022 Matthew Hutson

Sly Saint · Oct 7, 2022

Despite agreeing to make raw data available, some authors fail to comply. The right strategies and platforms can ease the task.

Journals and funding bodies increasingly require manuscript authors to share data on request or make the information publicly available. It’s a big ask from a technical standpoint, but some straightforward strategies can simplify the process.

Scientific papers rarely include all the data used to justify the conclusions, even in the supplementary material. Authors might fear getting scooped, or that other researchers will use the raw data to make fresh discoveries, or they might wish to protect the privacy of study participants. Or, more probably, authors have neither the time nor the expertise to package the data for others to view and understand.

Such reticence costs the research community. Data transparency allows others to repeat analyses and catch mistakes or fraudulent claims. It allows for new findings through the reanalysis of existing data sets, and it increases trust in the scientific process. In August, the White House Office of Science and Technology Policy announced that, by 2025, scientific data from all new federally funded research must be made accessible to the US public. And when submitting papers, authors are increasingly required to provide raw data to editors, to place data online or to include data-sharing statements as to whether they will offer data on request. Unfortunately, such policies are not bulletproof, as the largest study of its kind starkly documents.

Tom Jefferson, an epidemiologist at the University of Oxford, UK, says authors should face consequences for making false data-availability statements. “The editors should take action, whether it’s a correction or retraction,” he says, adding that the excuse of no longer having the data to hand is like saying “the cat ate my filing cabinet”.

Valentin Danchev, a computational social scientist at Queen Mary University of London, calls the study a useful step towards understanding the actual state of data sharing. But, he adds, “we need more of those studies so that we can generalize across different areas and different survey designs”.

Last year, Danchev co-authored a study2 of 487 clinical trials that were published in JAMA, The Lancet or The New England Journal of Medicine. The authors of 89 of these articles said they’d stored data sets in online repositories, but Danchev’s team could find only 17 in the designated locations.

full article
https://www.nature.com/articles/d41586-022-03133-5

rvallee · Oct 7, 2022

It’s a big ask from a technical standpoint, but some straightforward strategies can simplify the process.

No, it's not. Not even a bit. This is my expertise. 50 years ago? Yeah, annoying and technically fraught with time-consuming labor. Today? It's trivial. Zero difficulty here. The difficulty here is entirely human, political.

Authors might fear getting scooped, or that other researchers will use the raw data to make fresh discoveries, or they might wish to protect the privacy of study participants. Or, more probably, authors have neither the time nor the expertise to package the data for others to view and understand.

See? Zero technical difficulties mentioned here. And the second to last one is trivial if done correctly, there is so much publicly available data out there that easily meets those standards, people have figured out how to do this a long time ago.

And I have to separate the last one because it's absurd. Experts do not have the expertise to package data for others to view and understand? Then they wouldn't be experts. It's not plausible to have the expertise to produce scientific data but somehow be unable to make the data available. This is not even a plausible excuse.

This is mostly political, and to some extent cultural. Medicine is used to operate in complete secrecy, and it's obviously a terrible way of doing things. This is more like zero tolerance policies that somehow can have lots of tolerance depending on who the authors are. All of this is a choice.

Sean · Oct 8, 2022

The only tricky bit in making data available is ensuring it is adequately anonymised (where required, e.g. clinical trials). The rest, as @rvallee points out, is a trivial exercise in the modern online world.

I am so tired of the excuses for not making data publicly available for all to access. Far as I am concerned any science paper that doesn't simply should not get published, or get retracted if it is already published (e.g. PACE).

It should be one of the most basic and non-negotiable rules for being allowed to play the serious science game.

Arnie Pye · Oct 8, 2022

rvallee said:
Medicine is used to operate in complete secrecy

I remember hearing a story some years ago, about a GP who started ranting and screaming at a patient because the patient had asked for a copy of their GP records. The GP was apparently pissed off because they were "his" records not the patient's, and they were his because he had written them. The GP considered the records to be none of the patient's business.

Edit : Oops - Just realised this is not really on-topic.

Haveyoutriedyoga · Oct 8, 2022

Pretty sure the FDA just made this mandatory, for at least a certain category of trials (sorry can't remember the source)

*in the US, obviosuly

cassava7 · Oct 8, 2022

rvallee said:
No, it's not. Not even a bit. This is my expertise. 50 years ago? Yeah, annoying and technically fraught with time-consuming labor. Today? It's trivial. Zero difficulty here. The difficulty here is entirely human, political.

See? Zero technical difficulties mentioned here. And the second to last one is trivial if done correctly, there is so much publicly available data out there that easily meets those standards, people have figured out how to do this a long time ago.

And I have to separate the last one because it's absurd. Experts do not have the expertise to package data for others to view and understand? Then they wouldn't be experts. It's not plausible to have the expertise to produce scientific data but somehow be unable to make the data available. This is not even a plausible excuse.

This is mostly political, and to some extent cultural. Medicine is used to operate in complete secrecy, and it's obviously a terrible way of doing things. This is more like zero tolerance policies that somehow can have lots of tolerance depending on who the authors are. All of this is a choice.

Data sharing in studies entails more than uploading it to a Git repository, though, and even that is a skill that needs training and practicing especially for people who do not have a background in computer science.

Studies where large datasets are generated now almost always involve statisticians who should have the necessary technical skills. For smaller studies, though, that is not the case, and the issue is that MDs, biologists, psychologists, sociologists and so forth don’t receive training on this during their education. I concede that many work with spreadsheets (e.g. Excel), which are readily shareable, but this does not cover all cases.

Ideally, a relevant course should be added to all university curriculums that teach applied statistics, but that is more wishful thinking than anything else and does not solve the problem for the current and older generations of researchers. The author is absolutely right in saying that tools and frameworks to facilitate data sharing should be created now and widely deployed.

This, of course, requires both funding and manpower (researchers), but thankfully it seems that more and more scientists have become aware of research integrity issues over the past decade and have expressed interest in solving them, so hopefully it will become a reality — sooner than later.

Independently of the above considerations on technical skills, the fact that authors who refuse to share data when they state that they will make it available on request is an unacceptable practice that journals should be addressing much more diligently. As one of the interviewees suggests, they should, at the very least, issue an expression of a concern when they are provided with proof of refusal or of no reply from the authors.

We can hope that if frameworks are developed and become widely adopted, publishers will be pushed into enforcing their use. Then, data would be available directly from journals rather than the authors, as supplementary material currently is. This, though, should already be current practice when the authors do agree to making their data available.

rvallee · Oct 8, 2022

Haveyoutriedyoga said:
Pretty sure the FDA just made this mandatory, for at least a certain category of trials (sorry can't remember the source)

*in the US, obviosuly

Not sure if the FDA but yes this is a recent change, for anyone living in the US, you now have full access to your medical records without restrictions or conditions. This is very recent, as in in recent weeks.

It's technically supposed to be friction-free, no need to justify anything.

Edit: literally this week, and in digital format: https://www.statnews.com/2022/10/06/health-data-information-blocking-records/, and it's the law, not just an administrative rule: "The new federal rules — passed under the 21st Century Cures Act"

Taking the pain out of data sharing, 2022 Matthew Hutson

Sly Saint

Senior Member (Voting Rights)

rvallee

Senior Member (Voting Rights)

Sean

Moderator

Arnie Pye

Senior Member (Voting Rights)

Haveyoutriedyoga

Senior Member (Voting Rights)

cassava7

Senior Member (Voting Rights)

rvallee

Senior Member (Voting Rights)