Sharing genetic research data

wallfish

Established Member
Recently, patients with an interest in bioinformatics have shared their own analyses of genetic data relating to ME/CFS. Their findings have been met with interest on the forum, and I think they support the case that researchers outside of a formal project structure can contribute something useful.

These analyses have so far been limited to summary statistics that are publicly available. There is only so much you can do with summary statistics. I started this thread to discuss whether it makes sense for certain studies to provide an interface to the source data.

Access to source data
DecodeME's source data is not public, and for good reason.
"DecodeME's anonymised and consented data are only shared with studies that meet high standards and whose academic or industrial researchers agree to treat its data with respect and to keep it secure."
The challenge with genetic data, unlike many other datasets, is that it cannot be truly anonymised: even if you strip names, identities can be reconstructed from publicly available genealogy databases.

Federated analysis
One solution is federated analysis, where the data remains with the owner and analysis code is sent by a user to the data. Only resulting summary information is accessible, not individual data rows.

Federated analysis is not trivial to implement; there is a tradeoff between security and expressiveness. I think DataShield (University of Liverpool) strikes a good balance for this application.

Some questions
  • Would the number of interested community users warrant setting up and maintaining such a system?
  • Would the added value from community researchers be significant? Or would most insights already be extracted from the data by the planned analysis activities?
  • Would this primarily attract independent researchers, or would 'organised' research teams benefit too?
Edit: from my post below: If you have an opinion on how these questions would apply to DecodeME, I'm interested to hear it!

Some reasons this might not be a good idea
  • Considerable effort to set up and maintain
  • Security incidents could jeopardise trust
  • Data might be abused, or claims made "based on DecodeMe data" from unsound statistical analysis
There are probably more good reasons against this.
 
Last edited:
For federated analyses, there are also costs of running the analyses and storage for output files which would be much higher than the cost of just hosting the input/source data.

Another problem is that things go wrong all the time. There might be a problem with the storage server or computer cluster and your pipeline fails. Or as a result you get corrupted files which might be your final output or used as an input later in the pipeline. Or your pipeline fails because you couldn't predict some edge case or you made a mistake with scheduling jobs. Who would be fixing that? How efficient would that be? If the pipeline is well structured, the code well written and you have control over running it, you could be fixing these things as they happen and move on.

Running pipelines can be very expensive. They might be running for weeks or months, and some people I know were running a single pipeline on their dataset for almost a year.
Would the added value from community researchers be significant? Or would most insights already be extracted from the data by the planned analysis activities?
Generally speaking, I guess it depends on the data and project.

Even if a team and their collaborators were able to do everything possible at the moment, there will be new biological insights and methodological developments that could use the existing data for new analyses.
Would this primarily attract independent researchers, or would 'organised' research teams benefit too?
Research teams typically have access to computing resources at their or collaborators' institutions and/or pay for cloud computing.
 
Thanks for sharing your thoughts, Felis Catus.
Another problem is that things go wrong all the time. [...] Who would be fixing that? How efficient would that be? If the pipeline is well structured, the code well written and you have control over running it, you could be fixing these things as they happen and move on.
Users should test their pipelines locally on dummy data. Of course, stuff can still go wrong. Some federated systems require human approval of all code, that would lead to some slow iterations. That's why I mentioned a platform like DataShield, which allows you to execute one command at a time, so the user can explore and fix their own problems. (I've only read some of their docs, haven't used the platform).
Running pipelines can be very expensive. They might be running for weeks or months, and some people I know were running a single pipeline on their dataset for almost a year.
I am aware of high computation times in other domains, but I have little experience with bioinformatics. Are the analyses we're likely to see in genomic studies of that order of magnitude? I would expect a lot of classical ML can be done within reasonable times. Users can likely be billed for their compute/storage at a similar price to what they would have to pay elsewhere.
Generally speaking, I guess it depends on the data and project.
If you or anyone else reading this has an opinion on how these questions would apply to DecodeME, I'm interested to hear it!
Even if a team and their collaborators were able to do everything possible at the moment, there will be new biological insights and methodological developments that could use the existing data for new analyses.
I agree, that's one argument in favour.
Research teams typically have access to computing resources at their or collaborators' institutions and/or pay for cloud computing.
Yes. The potential benefit wouldn't come from compute, but from easier access to data. That benefit applies more to independent researchers or interested patients than to teams, who are in a position to receive the source data from the owners directly (at least in the case of DecodeME -- some groups might not want to share data for legal or commercial reasons, in that case federated access is also a solution). Looks like I've found a benefit for teams with that last sentence. On the other hand, teams might be more likely to do advanced stuff for which the federated interface might not be expressive enough.

I would love to play around with the DecodeME data, but maybe that's just unrealistic without joining a research group. Maybe I'm spoiled from all the data that are freely available in other domains (that are easier to anonymise).
 
Back
Top Bottom