Recently, patients with an interest in bioinformatics have shared their own analyses of genetic data relating to ME/CFS. Their findings have been met with interest on the forum, and I think they support the case that researchers outside of a formal project structure can contribute something useful.
These analyses have so far been limited to summary statistics that are publicly available. There is only so much you can do with summary statistics. I started this thread to discuss whether it makes sense for certain studies to provide an interface to the source data.
Access to source data
DecodeME's source data is not public, and for good reason.
The challenge with genetic data, unlike many other datasets, is that it cannot be truly anonymised: even if you strip names, identities can be reconstructed from publicly available genealogy databases.
Federated analysis
One solution is federated analysis, where the data remains with the owner and analysis code is sent by a user to the data. Only resulting summary information is accessible, not individual data rows.
Federated analysis is not trivial to implement; there is a tradeoff between security and expressiveness. I think DataShield (University of Liverpool) strikes a good balance for this application.
Some questions
Some reasons this might not be a good idea
These analyses have so far been limited to summary statistics that are publicly available. There is only so much you can do with summary statistics. I started this thread to discuss whether it makes sense for certain studies to provide an interface to the source data.
Access to source data
DecodeME's source data is not public, and for good reason.
"DecodeME's anonymised and consented data are only shared with studies that meet high standards and whose academic or industrial researchers agree to treat its data with respect and to keep it secure."
Federated analysis
One solution is federated analysis, where the data remains with the owner and analysis code is sent by a user to the data. Only resulting summary information is accessible, not individual data rows.
Federated analysis is not trivial to implement; there is a tradeoff between security and expressiveness. I think DataShield (University of Liverpool) strikes a good balance for this application.
Some questions
- Would the number of interested community users warrant setting up and maintaining such a system?
- Would the added value from community researchers be significant? Or would most insights already be extracted from the data by the planned analysis activities?
- Would this primarily attract independent researchers, or would 'organised' research teams benefit too?
Some reasons this might not be a good idea
- Considerable effort to set up and maintain
- Security incidents could jeopardise trust
- Data might be abused, or claims made "based on DecodeMe data" from unsound statistical analysis
Last edited: