MECFS data analysis thread

SNT Gatchaman · Mar 25, 2024

When comparing blood metabolomics papers, note the preference for plasma over serum as described in Plasma Instead of Serum Avoids Critical Confounding of Clinical Metabolomics Studies by Platelets (2024, Journal of Proteome Research). It may be best to discard any serum studies, or at least sub-group them like male:female.

Midnattsol · Mar 27, 2024

FMMM1 said:
I think metabolomics is basically used when you don't know the target pathway, gene --- hypothesis/concept free - similar to GWAS [DecodeME].

Just like with other omics, it's possible to do targeted metabolomics, looking for specific metabolites. This can be somewhat problematic as some metabolites are then often included in studies because they have previously been found to be significantly different - when this finding may have been spurious. Still the evidence base for the specific metabolites grows, while other metabolites that may have been more important are not studied further (made worse by how some metabolites are easier to isolate and measure than others).

Trish said:
I guess for the women there may also be differences across their menstrual cycle. Also I'm guessing dietary differences may affect metabolites, eg if on keto or vegan diets?

There are at least two small n studies on metabolomic changes during the menstrual cycle, but they have all the same issues as have already been mentioned, although at least with a cycle you get several measurements from the same person.

Dietary differences don't have to be as dramatic as going on a keto vs vegan diet, a simple difference in preferred source of protein or cooking fat could be enough to get changes in amino acids and/or lipids. With lipids there is also dietary history as lipids consumed previously are released from adipose tissue and into circulation to be used for energy.

chillier said:
You might hope that having a longitudinal design where you compare the differences between time points of the same group of people that might reduce the noise at least in part.

I'd love a longitudinal design! Preferably together with some way of indicating if in PEM or not. Or other symptoms. I'm not sure my metabolome would necessarily be different from a healthy person's in periods when I feel fine, but when I feel poorly... and what about those periods when one feels fine except when trying to do something (I at least have them, I can feel perfectly fine when for example sitting still, but trying to move or do something cognitively demanding will make me feel awful).

I'd want a team to be able to come home to participants to take samples, to reduce differences in strain to get to a test site. And participants to be able to say how their level of activity had been in the last couple of days (for example it could be helpful to know if someone had not been able to pace for whatever reason, or felt they were going to get PEM, were in PEM etc). Ideally I'd also want each participant to get a similar type of diet sent to their home to standardise diet, and maybe an activity tracker of some sort... if one only had the resources

wigglethemouse said:
I guess a simpler way to explain - garbage in garbage out if you don't implement optimized sample handling and processing.

But what is possible/optimal for one study set up may not be so for another. And now there was the recent event of a biobank freezer that stopped working destroying so much collected samples I could cry

Murph · Mar 28, 2024

Great work @chillier !

i want to come back and respond to that but first I will post the data and code for the analysis I did above.

So here's a link for the Hanson 2020 data (clicking this will download a zip file to your computer) https://www.mdpi.com/2218-1989/10/1/34/s1
and here's a link to the code I used to compare it to Hanson 2022.

https://github.com/jasemurphy/mecfs/blob/main/Hanson_2020_vs_2022.R

Murph · Mar 28, 2024

chillier said:
I've checked here using a linear model whether there is evidence for a given metabolite, that the two studies disagree in their fold change estimate (depends both on the fold change estimate but also the variance of the data).

For the whole lot of common detected metabolites you can see the papers actually agree (or strictly speaking: don't necessarily disagree) on the vast majority of metabolites, it's just that what they're agreeing on is that there's no good evidence that the metabolite is different between ME and HC. Looking only at the significant ones you can see it's roughly half that disagree.

View attachment 21526 View attachment 21527

Anyway none of the metabolites that agree and are significant pass multiple test correction. The ones that pass multiple test correction and don't agree appear to be drug related metabolites. To be fair, Germain/Hanson 2022 already argue there isn't really much of a difference in the metabolomes at baseline, it's only after exercise that it gets interesting.

(The p values here are recalculated from a combined analysis of both datasets with a log linear model and disagreement determined from an interaction term between group and paper)

1. This is so wonderful to see, thanks heaps for your hard work! my dream with this project was the many impressive people in this community would pitch in and it is starting already!

2. Is your topline takeaway from this that hanson 2020 and 2022 have no standout areas of agreement? Is that a fair summary? of course there are those metabolites that they agree on and which are significant but just as many that they disagree on which are significant. It would be excellent to see if any *other* studies have looked at those that they do agree on.

3. would you be willing to share your code? even just crtl+c, ctrl+v dumping it into a comment here would be useful to me!

4. did you use the D1-PRE data from Hanson 2022? and does it change if you look at the other timepoints?

Murph · Mar 28, 2024

This next chart compares Hanson 2020 to Naviaux 2017. There's not loads of metabolites in common but the ones I found in common do not show strong agreement. Note this chart has lipids and other molecules all mixed in.

here's the code for this one: https://github.com/jasemurphy/mecfs/blob/main/Hanson_2020_vs_Naviaux_2017.R

I imagine some people may view this analysis as basic but there's a method to the approach. Before we go to complex interrogation of the data it's worth asking it some very simple questions. So far it doesn't seem to have many answers. Showing that fact is a bit like reporting your null results.

Murph · Mar 28, 2024

Here's another one while I'm on a roll. Naviaux vs Fluge 2021. This is the Fluge paper where they run some unsupervised machine learning to create subsets. This chart has all the subsets bundled together but the next step might be to see how it looks if you plot each separately.

code for this one: https://github.com/jasemurphy/mecfs/blob/main/Naviaux2017_vs_Fluge2021.R

FMMM1 · Mar 28, 2024

Midnattsol said:
Just like with other omics, it's possible to do targeted metabolomics, looking for specific metabolites. This can be somewhat problematic as some metabolites are then often included in studies because they have previously been found to be significantly different - when this finding may have been spurious. Still the evidence base for the specific metabolites grows, while other metabolites that may have been more important are not studied further (made worse by how some metabolites are easier to isolate and measure than others).

Jonathan Edwards's post here*, re TPPP1 gene, illustrates(?) a potential scenario i.e. a target gene/pathway is discovered and you run a metabolomics study on that group:

untargeted i.e. measure all the metabolites you can (hypothesis free);
targeted at e.g. pathway you consider should be relevant (hypothesis driven).

As per your comment, a hypothesis driven (i.e. targeted) metabolic study might e.g. be able to look at adjusting the procedure to increase sensitivity (detectability) of certain metabolites.

Still haven't seen a comment re limitation of Metabolon data (1000 metabolites) and how that could be addressed e.g. did Hanson do a study looking for a wider range of metabolites - is that the sort of thing we need?

*Jonathan - "Who knows? - we would need to find out. Maybe that the critical abnormal process in ME is so hard to observe because it involves supramolecular solid phase changes within brain cells at a level that at present we have no means to observe directly?"
https://www.s4me.info/threads/genet...022-hajdarevic-et-al.25070/page-3#post-411554

Murph · Mar 29, 2024

There are some great points here about the limits of knowledge. I want to describe how I see it.

We know most studies cover only a fraction of the metabolites in the body. Even when thousands are measured it's possible there's a smoking gun that we haven't measured or that science can't yet measure.

It's possible that the body fluids measured don't contain any evidence of illness at all. or it's possible the evidence is present only under certain conditions.

Nevertheless it's also possible there is a metabolite or two that will show up as elevated across several studies, hiddden only by statistical noise. That's what i'm trying to sort out by including everything, whether deemed statistically significant in an individual study or not. A perfectly useful answer will be, nope, you can't find anything that stands out at a metabolite level. The result of that might be to change the way scientists approach metabolomics in mecfs.

After metabolite-level analysis the second-best approach is pathway based analysis.

chillier · Mar 30, 2024

Murph said:
1. This is so wonderful to see, thanks heaps for your hard work! my dream with this project was the many impressive people in this community would pitch in and it is starting already!

2. Is your topline takeaway from this that hanson 2020 and 2022 have no standout areas of agreement? Is that a fair summary? of course there are those metabolites that they agree on and which are significant but just as many that they disagree on which are significant. It would be excellent to see if any *other* studies have looked at those that they do agree on.

3. would you be willing to share your code? even just crtl+c, ctrl+v dumping it into a comment here would be useful to me!

4. did you use the D1-PRE data from Hanson 2022? and does it change if you look at the other timepoints?

1) Thank you! I'm glad you think so

2) I think that's basically right, just because the ones they agree on are not significant after multiple test correction. I don't think it's necessarily problematic that they disagree on lots of things though - the ones they disagree on that are most significant are all drug metabolites and it's totally fine that those disagree.

3) Yep, no problem:

R code said:
library(tidyverse)
setwd("")

################### DATA PREPROCESSING 2020 ################

h2020 <- read_csv("hanson2020.csv")
colnames(h2020) <- make.names(colnames(h2020),allow_=F)
h2020 <- h2020[,-c(1,4,6,7,8,9,10,11,12,13,14,67)]

# Function to replace NA with minimum value in each row
replace_na_with_min <- function(row) {
min_val <- min(row, na.rm = TRUE)
row[is.na(row)] <- min_val
return(row)
}

# impute missing values with minimum from each metabolite
h2020[h2020==0] <- NA
h2020 <- bind_cols(h2020[,c(1,2,3)],as_tibble(t(apply(h2020[,-c(1,2,3)], 1, replace_na_with_min))))

# normalise by 10,000 units of intensity seen in that sample
h2020 <- h2020 %>% mutate_at(vars(C126), funs((./sum(.))*10000))

################### DATA PREPROCESSING 2022 ################

h2022 <- read_csv("hanson2022_raw2.csv",col_names=F)
colnames(h2022) <- make.names(c(as.character(h2022[8,1:15]),"uniqID",1:404))

lookup22 <- rbind(colnames(h2022),h2022[1:7,])
lookup22 <- lookup22[,-c(1:15)] %>% t %>% as_tibble
colnames(lookup22) <- make.names(lookup22[1,])
lookup22 <- lookup22[-1,]

h2022 <- h2022[-c(1:8),]
h2022 <- h2022[,-c(1,4,6,7,8,9,10,11,12,13,14,15,16,421,422)]

# replace NAs with minimum values, then remove metabolites with no data
h2022 <- h2022 %>% mutate_at(vars(X1:X404), funs(as.numeric(gsub(",", "", ., fixed = TRUE))))
h2022$COMP.ID <- as.numeric(h2022$COMP.ID)
h2022 <- bind_cols(h2022[,c(1,2,3)],as_tibble(t(apply(h2022[,-c(1,2,3)], 1, replace_na_with_min))))
h2022 <- h2022 %>%
mutate(across(everything(), ~replace(., is.infinite(.), 0)))
h2022 <- h2022[-which(rowSums(h2022[,-c(1,2,3)]) == 0),]

# normalise by 10,000 units of intensity seen in that sample
h2022 <- h2022 %>% mutate_at(vars(X1:X404), funs((./sum(.))*10000))

# filter to only D1-PRE (TO LOOK AT OTHER POINTS JUST CHANGE D1-PRE BELOW TO WHATEVER)
uniqID_to_remove <- filter(lookup22,Timepoint!="D1-PRE")$uniqID
h2022 <- select(h2022, -uniqID_to_remove)

################## TRIAL RUN JUST PICKING OUT AND COMPARING PROLINE (AND SUITABILITY OF LOG2 TRANSFORM) ################

# find proline for a test run
h2020[grep("proline",h2020$BIOCHEMICAL),]
h2022[grep("proline",h2022$BIOCHEMICAL),]

# proline's comp id is 1898
proline2022 <- h2022[which(h2022[,3]==1898),]
proline2020 <- h2020[which(h2020[,3]==1898),]

# comparing normality assumptions with untransformed and log2 transformed data
proline2020_vec <- as.numeric(proline2020[1,-c(1:3)])
logpro20_vec <- log2(proline2020_vec)
hist(proline2020_vec)
qqnorm(proline2020_vec)
qqline(proline2020_vec)

# log fits better (for proline in 2020 data)
hist(logpro20_vec)
qqnorm(logpro20_vec)
qqline(logpro20_vec)

proline2022_vec <- as.numeric(proline2022[1,-c(1:3)])
logpro22_vec <- log2(proline2022_vec)
hist(proline2022_vec)
qqnorm(proline2022_vec)
qqline(proline2022_vec)

# log fits better (for proline in 2022 data)
hist(logpro22_vec)
qqnorm(logpro22_vec)
qqline(logpro22_vec)

# preprocessing dataframe so group names are consistent
pro20_long <- proline2020 %>% gather("group", "y", -c(BIOCHEMICAL,SUPER.PATHWAY,COMP.ID))
pro20_long <- pro20_long %>%
mutate(group = case_when(
startsWith(group, "C") ~ "Control",
startsWith(group, "P") ~ "CFS",
TRUE ~ group # if neither condition is met, keep the original value
))

pro22_long <- proline2022 %>% gather("uniqID", "y", -c(BIOCHEMICAL,'SUPER.PATHWAY','COMP.ID'))
pro22_long <- left_join(pro22_long,lookup22,by="uniqID") %>% select(BIOCHEMICAL,SUPER.PATHWAY,COMP.ID,group = Phenotype,y)

# combining data
pro22_long$paper <- rep("2022",nrow(pro22_long))
pro20_long$paper <- rep("2020",nrow(pro20_long))
pro_long <- bind_rows(pro20_long,pro22_long)
pro_long$logy <- log2(pro_long$y)

# making linear model with interaction term for proline
pro_m1 <- lm(logy ~ group + paper + groupaper,data=pro_long)
# can not say that the fold changed observed in on paper is different to the fold change observed in the other
summary(pro_m1)

ggplot(pro_long, aes(x=group,y=logy,color=paper)) +
geom_jitter(alpha=0.5) +
geom_violin()

############## GENERALISED FUNCTION FOR COMPARING METABOLITES AND INTERACTIONS WITH PAPER ##############

# function just generalised version of what was done for proline
get_met_model <- function(comp_id, log2_tranform=TRUE) {
met2022 <- h2022[which(h2022[,3]==comp_id),]
met2020 <- h2020[which(h2020[,3]==comp_id),]

met20_long <- met2020 %>% gather("group", "y", -c(BIOCHEMICAL,SUPER.PATHWAY,COMP.ID))
met20_long <- met20_long %>%
mutate(group = case_when(
startsWith(group, "C") ~ "Control",
startsWith(group, "P") ~ "CFS",
TRUE ~ group # if neither condition is met, keep the original value
))
met22_long <- met2022 %>% gather("uniqID", "y", -c(BIOCHEMICAL,SUPER.PATHWAY,COMP.ID))
met22_long <- left_join(met22_long,lookup22,by="uniqID") %>%
select(BIOCHEMICAL,SUPER.PATHWAY,COMP.ID,group = Phenotype,y)
met22_long$paper <- rep("2022",nrow(met22_long))
met20_long$paper <- rep("2020",nrow(met20_long))
#print(met20_long)
#print(met22_long)
met_long <- bind_rows(met20_long,met22_long)
met_long$logy <- log2(met_long$y)
met_m1 <- lm(logy ~ group + paper + groupaper,data=met_long)
#summary(met_m1)
# pulling out relevant coefficients and p values from the model and returning them
FC_2020 <- coef(summary(met_m1))[2,1]
FC_2022 <- coef(summary(met_m1))[2,1]+coef(summary(met_m1))[4,1]
FCdiff_p <- coef(summary(met_m1))[4,4]
FC_p <- coef(summary(met_m1))[2,4]
terms <- c(comp_id, FC_2020, FC_2022, FCdiff_p, FC_p)
terms
}

# sanity check: test output matches proline summary from model above
m1 <- get_met_model(comp_id=1898)
summary(pro_m1)
m1
# it does!

#################### CALCULATE P VALS, INTERACTION P VALS ETC FOR EVERY METABOLITE ###################

# go through every common metabolite between the two papers and calculate parameters
common_compids <- intersect(h2022$COMP.ID,h2020$COMP.ID)
metabolite_interactions <- matrix(NA, length(common_compids),5)
for (i in 1:length(common_compids)) {
terms <- get_met_model(comp_id=common_compids)
metabolite_interactions[i,1] <- terms[1]
metabolite_interactions[i,2] <- terms[2]
metabolite_interactions[i,3] <- terms[3]
metabolite_interactions[i,4] <- terms[4]
metabolite_interactions[i,5] <- terms[5]
}

# make tibble, change names, put log2(FC) estimates to 2^ to get FC estimates,
metabolite_interactions <- metabolite_interactions %>% as_tibble()
names(metabolite_interactions) <- c("COMP.ID","FC_2020","FC_2022","FCdiff_p","FC_p")
metabolite_interactions <- metabolite_interactions %>% mutate_at(vars(FC_2020:FC_2022), funs(2^.))

# df of all metabolites in both papers and their names etc
h_all <- bind_rows(h2022[,c(1:3)],h2020[,c(1:3)]) %>% distinct()

# add names to metabolite_interactions df
h_met_interactions <- left_join(metabolite_interactions, h_all, by="COMP.ID")

# add Agree column for if interaction p val < 0.05
h_met_interactions <- h_met_interactions %>%
mutate(Agree = ifelse(FCdiff_p < 0.05, 'N', 'Y')) %>%
mutate(significant = ifelse(FC_p < 0.05, 'Y', 'N'))

# multiple test correct p vals
h_met_interactions$FC_q <- p.adjust(h_met_interactions$FC_p, method="BH")

############### PLOTTING ##################

plot1 <- ggplot(h_met_interactions, aes(x=FC_2020, y=FC_2022,colour=Agree,shape=significant)) +
geom_point(size=3) +
geom_hline(yintercept=1) +
geom_vline(xintercept=1) +
theme_minimal() +
labs(x="Germain & Hanson et al 2020", y="Germain & Hanson et al 2022") +
scale_colour_manual(values = c("N" = "#7BAFD4", "Y" = "#FB9A99")) +
ggtitle("All metabolites found in both papers")
#ggsave("allmetabs_bothpapers.png",plot1,bg="white")

plot2 <- ggplot(h_met_interactions[which(h_met_interactions$significant=="Y"),], aes(x=FC_2020, y=FC_2022,colour=Agree)) +
geom_point(size=3,shape=17) +
geom_hline(yintercept=1) +
geom_vline(xintercept=1) +
theme_minimal() +
scale_colour_manual(values = c("N" = "#7BAFD4", "Y" = "#FB9A99")) +
labs(x="Germain & Hanson et al 2020", y="Germain & Hanson et al 2022") +
ggtitle("Metabolites found significant in this analysis")
#ggsave("significantmetabs_bothpapers.png",plot2,bg="white")

# plot 1 is all metabolites
plot1
# plot 2 is only significant metabolites
plot2

# to peruse
h_met_interactions %>% filter(significant == "Y" & Agree=="Y") %>% fix
h_met_interactions %>% arrange(FC_q) %>% fix

hanson2020.csv is the link you gave https://www.mdpi.com/2218-1989/10/1/34/s1 supplementary 1, saved as a csv. hanson2022_raw2.csv is https://insight.jci.org/articles/view/157621/sd/2 with the 'original' sheet saved as a csv. Then just set the working directory at the top and it should all run without issue hopefully. The plots it generates look slightly different to the ones I posted before, because previously I used the scaled data whereas now I am normalising myself with the raw data - I'm normalising slightly differently (by sample rather than by metabolite).

4) Yes I'm using D1-PRE, you can change to other time points from one line in the code (in the data preprocessing section).

Murph · Apr 5, 2024

I just want to point out that these kind of -omics studies are still coming! Bergquist and Armstrong are going to drop a really detailed study soon, with LOADS of datapoints taken at 20 minute intervals and a couple of thousand metabolites and lipids. I'm keen to wrestle the existing data into a format where we can see more easily whether the upcoming findings confirm prior signals, upset the apple cart or simply reaffirm that the whole field is too noisy.

Some of the findings in the video look quite good I will say.

wigglethemouse · Apr 5, 2024

Murph said:
I'm keen to wrestle the existing data into a format where we can see more easily whether the upcoming findings confirm prior signals, upset the apple cart or simply reaffirm that the whole field is too noisy.

It seemed from the Berquist presentation where those slides come from (Lisbon Conference Apr 4th 2024) that they really wanted to dive into the Lipids to get a better understanding about what exactly is happening. Low ceramides showed yet again!. I think he said that he was putting together an even more detailed metabolomic analysis that will have even more lipid identification.

What was interesting is that there are two arms to the study. Berquist used exercise, social activity, and a mental task to track how metabolites change over 8 hours and Armstrong will have a set of 3 controlled meals to see how diet affects the metabolites. These are in home studies to remove the stress of traveling and being at a research lab.

wigglethemouse · Apr 5, 2024

I just came across the data for Lipkin & Fiehn 2022 metabolite paper in case you haven't seen it @Murph
https://pdfs.semanticscholar.org/d2c1/82b950b2529e8ea1a678813bb142abdf2b86.pdf

forestglip · Dec 29, 2024

Cool idea @Murph!

I just yesterday started something in the same vein. Complement keeps popping up, so I thought it could be useful to look at every study that has tested any complement proteins, compile them into one dataset, and visualize which ones are consistently higher or lower across studies. Here's some of the studies I've looked at so far:

Then I'll make each protein have its own column and each study have its own row. The cell where they intersect will be dark red if it was increased, dark blue if it was decreased, light red if it increased but wasn't significant, same with light blue, and grey if not significant and no details on which direction. That way you could look down a column and see if it's consistently blue or red.

It'd probably be more interesting to compare actual values, but that's probably a lot more work, so I'll start here.

And maybe after this I'll try to do the same with all proteins from proteomics studies.

Though some people here have made good points. One thing that is particularly worrying for a hope of finding anything with this approach is whether the thing "to be found" is affected by exertion, since ME/CFS is so closely related to exertion. If it's higher than controls when rested and lower than controls after exertion, it'll be completely missed here if different study populations had recently done different levels of exertion.

But I think there is still a chance there's some consistent factor that is always higher or always lower in ME/CFS.

forestglip · Dec 29, 2024

So I think it makes more sense to just go all in on all proteins instead of just complement then coming back to everything later. I'd have to go over pretty much every paper twice.

So here's what I've got of my half-thought out plan so far.

I got all the findings I could from this study so far: Preprint Role of the complement system in Long COVID, 2024, Farztdinov, Scheibenbogen et al.

The supplementary files don't have the data I need, so I had to just read it off Figure 1. I just copied the volcano plots for the four cohorts they studied into an image editor and marked the proteins as I logged the data.

If they were significantly decreased they got -2, non-significantly decreased -1, non-significantly increased 1, significantly increased 2. 0 if not significant and they don't say which direction. (And 3 if not significant and I can't tell which direction. In figure 1D, I can't tell whether some of the circles are red or blue.

And so this is kind of what I hope to create with many studies. This is all from the above study, but they looked at four cohorts. (The second is just a subset of the first.)

I'm sure it'll be quite annoying trying to deal with files of ~1000 proteins for some studies. Will try to automate where I can.

mariovitali · Dec 30, 2024

@forestglip Would it be possible to share this work via Google sheets or something similar? Let me know if I can help on extracting this information

forestglip · Dec 30, 2024

mariovitali said:
@forestglip Would it be possible to share this work via Google sheets or something similar? Let me know if I can help on extracting this information

Sure yeah that'd be awesome. I can DM later to share access.

I'm still not completely sure what I want to do. It'd be nice to create a huge dataset with fold change and significance values for every protein from lots of studies, to do more in depth analysis like what Murph and chillier were posting. But lots of studies don't have that data, they just discuss a few proteins in the text and don't give actual numbers.

So maybe for every study, trying to write down everything that's significant as a binary like above (significant or not, up or down), for the more simple analysis. And if a study has it, adding FC and p values to maybe do something else with eventually.

mariovitali · Dec 30, 2024

@forestglip Thank you ! What I am looking for is to end up with the data you described and feed them to my software framework. What I am hoping to do is to generate a pathway analysis that will help us identify the bigger picture. An example is this thread which was made possible by analysing information using various conceptual levels :

https://twitter.com/user/status/1863564685510861030

forestglip · Dec 30, 2024

mariovitali said:
@forestglip Thank you ! What I am looking for is to end up with the data you described and feed them to my software framework. What I am hoping to do is to generate a pathway analysis that will help us identify the bigger picture. An example is this thread which was made possible by analysing information using various conceptual levels :

https://twitter.com/user/status/1863564685510861030

Sure, I mainly just want to make a large dataset with as many studies as possible so it can be shared and analyzed by anyone in whatever way they want, including the method you said.

forestglip · Feb 3, 2025

Just posting in case anyone was waiting for me to do something with what I described. I tried, I crashed, I failed. But just in case it's a useful idea, I'll describe what I wanted to do.

I was trying to make a website where you enter the URL of a paper and it makes an entry for that paper. Then for that study, you would add "comparisons" which are pairs of two cohorts (e.g. ME/CFS as the "active cohort" and healthy as the "control cohort"). For each cohort comparison, you then enter in all markers that were tested between those groups and the findings.

You would enter a marker, for example a protein like TNF, and it would search markers that have already been entered into the website, as well as existing databases like UniProt and Human Metabolome Database, and show top search results. You'd select the matching protein, for example P01375 · TNFA_HUMAN from UniProt, and it would store that identifier in the website for this marker. This way you could make sure TNF is identified the same way in every study, and not "TNF" in one study and "TNF-alpha" in another. Also, these databases have lots of additional information and keywords that could be added to the website database for each marker, so that it's easier to search study markers, for example searching by TNF's alternate name "Cachectin" or using keywords from the protein's function description.

After selecting the correct marker, you'd enter in whatever the study reported about it. Fold change, whether significant or not, p value, effect size, direction, etc. Probably also some standardized format for the location in the body where the marker was tested (CSF, plasma, etc.)

If the study gives spreadsheets with lots of markers, maybe there could be a more automated way of entering data.

I was thinking something like wiki format, where anyone can add to it. Although there's the issue of verifying whether people are entering real data or nonsense.

I also thought that it would be valuable if a large institution did some very centralized version of this, like a PubMed for study findings, where study investigators themselves would submit the findings to the database. And maybe a dedicated team that also adds in markers for papers that haven't been submitted by investigators.

It seems like this could be useful to detect patterns that may be nearly impossible to detect currently. With probably at least tens of thousands of different markers written in unstructured format in papers, and hundreds of thousands or millions of papers, there are probably lots of interesting patterns hidden in existing papers that are very hard to find. You could automate meta-analyses that pool data and run every time a new paper is added send out an alert if something interesting has reached significance.

As an example to go back to what I was writing about before, it'd be interesting to look at every single complement protein that has been tested in every long COVID paper and see if the proteins that are most consistently low or high provide any useful information. And also check if the same proteins match with ME/CFS studies. Or MS or depression.

Yann04 · Feb 3, 2025

forestglip said:
Just posting in case anyone was waiting for me to do something with what I described. I tried, I crashed, I failed. But just in case it's a useful idea, I'll describe what I wanted to do.

I was trying to make a website where you enter the URL of a paper and it makes an entry for that paper. Then for that study, you would add "comparisons" which are pairs of two cohorts (e.g. ME/CFS as the "active cohort" and healthy as the "control cohort"). For each cohort comparison, you then enter in all markers that were tested between those groups and the findings.

You would enter a marker, for example a protein like TNF, and it would search markers that have already been entered into the website, as well as existing databases like UniProt and Human Metabolome Database, and show top search results. You'd select the matching protein, for example P01375 · TNFA_HUMAN from UniProt, and it would store that identifier in the website for this marker. This way you could make sure TNF is identified the same way in every study, and not "TNF" in one study and "TNF-alpha" in another. Also, these databases have lots of additional information and keywords that could be added to the website database for each marker, so that it's easier to search study markers, for example searching by TNF's alternate name "Cachectin" or using keywords from the protein's function description.

After selecting the correct marker, you'd enter in whatever the study reported about it. Fold change, whether significant or not, p value, effect size, direction, etc. Probably also some standardized format for the location in the body where the marker was tested (CSF, plasma, etc.)

If the study gives spreadsheets with lots of markers, maybe there could be a more automated way of entering data.

I was thinking something like wiki format, where anyone can add to it. Although there's the issue of verifying whether people are entering real data or nonsense.

I also thought that it would be valuable if a large institution did some very centralized version of this, like a PubMed for study findings, where study investigators themselves would submit the findings to the database. And maybe a dedicated team that also adds in markers for papers that haven't been submitted by investigators.

It seems like this could be useful to detect patterns that may be nearly impossible to detect currently. With probably at least tens of thousands of different markers written in unstructured format in papers, and hundreds of thousands or millions of papers, there are probably lots of interesting patterns hidden in existing papers that are very hard to find. You could automate meta-analyses that pool data and run every time a new paper is added send out an alert if something interesting has reached significance.

As an example to go back to what I was writing about before, it'd be interesting to look at every single complement protein that has been tested in every long COVID paper and see if the proteins that are most consistently low or high provide any useful information. And also check if the same proteins match with ME/CFS studies. Or MS or depression.

That seems fascinating, but also a very hard task to program from scratch.

Cool Idea though!

MECFS data analysis thread

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Senior Member (Voting Rights)

Moderator

Moderator

Senior Member (Voting Rights)

Moderator

Senior Member (Voting Rights)

Moderator

Moderator

Senior Member (Voting Rights)