Machine Learning-assisted Research on ME/CFS

Thanks. If I understand correctly, though, your network analysis is based on mentions in abstracts and the text of a publication and not on the actual data of studies?
First of all, Thank you for the reply. Much appreciated.

What you say is correct. All of the work that was made was an attempt to connect medical concepts that have been appearing to the text of the abstracts. This was the first part of the work , the Network Analysis (which was later used by Wenzhong Xiao).

The second part of the work was taking the various symptoms of MECFS, retrieving abstracts related to each of these symptoms and then asking from machine learning to tell us which combination of topics could predict a symptom vs a non-symptom state.


I do not quite understand what you mean with "actual data" of studies so I would appreciate your time in explaining what this means. But for the moment let's look at results so far given the input I described.

The key question here is : Can the above appear by pure chance or not? What do we need to make a claim that the above methodology works indeed ? From what I understand a key part is to find the number of plausible topics for MECFS and then run an appropriate analysis. I find it extremely hard that this can reliably take place but I am open to any suggestions. I can provide a full list of identified concepts from my part.

If this is cherry picking then I see no harm done apart from wasting 15 years of my life in trying to convince others but at least I can now function and have a near-normal life. So harm for my time (and income) but no harm for the patient's health and wallets.

But if indeed is something taking place here then we are talking about negative bias consistently over the years. Can patients "afford" not to look at this methodology? @forestglip @Hutan also would appreciate your input.
 
I do not quite understand what you mean with "actual data" of studies so I would appreciate your time in explaining what this means.
The raw data of the experiments, for example, in CSV format.

A big problem I see is that almost all studies misrepresent their data in the abstract and text (to make it look like a bigger deal than it is or to promote the authors' favoured theory).

So I think machine learning/AI/network analysis, etc. will only be useful if we skip how authors present their findings and only train it on the actual data, with strong selection criteria so that only high-quality experiments, such as, e.g., DecodeME, are included.
 
But if indeed is something taking place here then we are talking about negative bias consistently over the years. Can patients "afford" not to look at this methodology?
Looking at the list above, I do wonder about matching findings by chance. How many genes did your program identify, and how many genes did the latest PrecisionLife study identify? Didn't the latter report a few hundred or a few thousand genes? Seems to be an opportunity for a few genes to overlap by chance.

I think it would be important to clearly outline what exactly the methodology is. I only have a vague idea of what is being done.

I'm sorry, I don't otherwise have the energy to really try to understand this, and I'm not sure I'd be able to follow a detailed description of the entire pipeline anyway, but it might be good to make for demonstrating to someone that there's some promise to this tool. It's too hard for anyone new to these ideas to try to discern what is being argued from 10 pages of posts in this thread, and various assorted posts on other threads and websites.
 
Last edited:
The raw data of the experiments, for example, in CSV format.

A big problem I see is that almost all studies misrepresent their data in the abstract and text (to make it look like a bigger deal than it is or to promote the authors' favoured theory).

So I think machine learning/AI/network analysis, etc. will only be useful if we skip how authors present their findings and only train it on the actual data, with strong selection criteria so that only high-quality experiments, such as, e.g., DecodeME, are included.


I see so my understanding is that you reference Garbage-In, Garbage-Out (GIGO). If the input data is garbage then the output is Garbage.

The thing is that in my analysis I did not look at the results of the studies. The search space of the analysis was built by identifying which concepts appeared together in research papers so basically a co-occurrence analysis was at the basis of it. I wanted to understand the connections between symptoms and various medical concepts.

If this is so, does this change your GIGO belief?

@forestglip you said :

Looking at the list above, I do wonder about matching findings by chance

Wondering is one thing but we cannot dismiss a hypothesis -especially one with repeated confirmations- that easily, correct ? My message above was about how can we identify whether the computational techniques I used did in fact do better than pure chance. Which is the search space that I should use ? 19000 human genes ? How many pathways ?How many symptoms ?

In any case Thank you both for your time
 
Wondering is one thing but we cannot dismiss a hypothesis -especially one with repeated confirmations- that easily, correct ? My message above was about how can we identify whether the computational techniques I used did in fact do better than pure chance. Which is the search space that I should use ? 19000 human genes ? How many pathways ?How many symptoms ?
One option is using a hypergeometric test to check how likely it is to have as large of an overlap as you got by chance. Described here with a built-in calculator: https://statisticsbyjim.com/probability/hypergeometric-distribution/ (Though the calculator doesn't seem to work when the sample size is too large, but this one does: https://sebhastian.com/hypergeometric-distribution-calculator/)

So the problem could be described as: there are ~20,000 genes, which would be the "population size" of genes (N). PrecisionLife identified 259 genes (or did you compare your results to the larger list of 2311 genes?) that we can consider the number of "true events" in the population (K).

Your pipeline came up with some number of genes (i.e. you selected a sample from the population of 20,000). Let's say your pipeline selected 300 genes from the full population of 20,000. This is the sample size (n). And let's say 4 of them were "true events" (they overlap with the 259 genes from PrecisionLife). We want to know how likely it is to get 4 or more success in your sample, if we assume your sample of genes was chosen randomly.

Using the numbers above in the calculator (just example numbers, since I don't know how many genes your pipeline produced or whether you used the larger or smaller PrecisionLife set): N=20000, K=259, n=300, k=4. If these numbers were correct, then you'd get P(X ≥ 4) = 0.54624, i.e. about a 50% chance of getting 4 or more PrecisionLife genes in your pipeline assuming your pipeline was just selecting 300 genes at random, and thus you would not reject the null hypothesis.
 
The thing is that in my analysis I did not look at the results of the studies. The search space of the analysis was built by identifying which concepts appeared together in research papers so basically a co-occurrence analysis was at the basis of it. I wanted to understand the connections between symptoms and various medical concepts.

If this is so, does this change your GIGO belief?
Not really, because the co-occurrence of concepts is probably based on the text of the manuscript, not on the actual data.

Suppose that several studies found a slight increase in metabolite X, but do not mention it anywhere in their paper, it's only visible in the supplementary data: would the algorithm be able to pick this up?

Conversely, suppose several papers highlight immune cell Y in their abstract as being related to ME/CFS because this is a hot topic. But their underlying data do not support these statements; the results are described misleadingly to increase publication chances. Would the algorithm be able to see through this?

How does it deal, for example, with findings such as XMRV, which comes up in ME/CFS papers multiple times but turned out to be a false finding best to be ignored?
 
Not really, because the co-occurrence of concepts is probably based on the text of the manuscript, not on the actual data.

Suppose that several studies found a slight increase in metabolite X, but do not mention it anywhere in their paper, it's only visible in the supplementary data: would the algorithm be able to pick this up?

No this would not happen.
Conversely, suppose several papers highlight immune cell Y in their abstract as being related to ME/CFS because this is a hot topic. But their underlying data do not support these statements; the results are described misleadingly to increase publication chances. Would the algorithm be able to see through this?

Again, this would not happen
How does it deal, for example, with findings such as XMRV, which comes up in ME/CFS papers multiple times but turned out to be a false finding best to be ignored?

All of your concerns are valid and there are actually many other issues, for example Liver X Receptor may be found in the literature as LXR, NR1H3, NR1H2 so these discrepancies introduce further problems. You will have to see however this problem taking into account measures from Information Retrieval and more specifically Precision/Recall and the F-Score :

https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)

Now, the first example is related to having high precision. Remember, I wanted to build a map of associations between medical concepts. Supplementary data are irrelevant to such information. Your second example (XMRV) is spot on and it indeed introduces bias in the co-occurrence analysis. However :

1) Since there were millions of abstracts been analysed then it is logical to say that such bias is minimised, correct ? Well known associations will have a high co-occurrence while the ones you mentioned will have very low.

2) The abstracts I analysed were not only related to ME/CFS. They were related to symptoms of ME/CFS so any ME/CFS related concept bias is expected to be low.

For the record, someone else mentioned the following on how to validate the methodology I used :

As for your method, to validate it, you would need a bench test where you calculate the sensitivity and specificity of the predictions. You should also have random predictions to compare your method with.

You can use diseases with a consolidated list of genes (diabetes, schizophrenia, etc), and you could see if your method predicts from literature before - let's say - 2015, what has been discovered by standard GWAS performed after that year. If you only use predicted genes, rather than pathways, it should be much easier.

Once you have demonstrated that this method makes predictions on certain diseases that are significantly better than random predictions, you can apply it with confidence to other diseases.

Thanks again for your time.
 
Back
Top Bottom