Neurodevelopment Genes Encoding Olduvai Domains Link Myalgic Encephalomyelitis to Neuropsychiatric Disorders, 2025, Lidbury et al

They have previously described this cohort in another paper:
Other clinical procedures, tests, cohort descriptions, and results have been reported previously, including comparisons with a healthy control group [29] and the calculation of symptom severity via Weighted Standing Time (WST) [30]. For these exome analyses, only the ME/CFS cohort was investigated, with 77 of the 80 participants initially recruited providing consent for this study, as well as meeting all inclusion criteria and availability requirements.
[29] Lidbury, B.A.; Kita, B.; Richardson, A.M.; Lewis, D.P.; Privitera, E.; Hayward, S.; de Kretser, D.; Hedger, M. Rethinking ME/CFS Diagnostic Reference Intervals via Machine Learning, and the Utility of Activin B for Defining Symptom Severity. Diagnostics 2019, 9, 79. [Google Scholar] [CrossRef] [S4ME]

[30] Richardson, A.M.; Lewis, D.P.; Kita, B.; Ludlow, H.; Groome, N.P.; Hedger, M.P.; de Kretser, D.M.; Lidbury, B.A. Weighting of orthostatic intolerance time measurements with standing difficulty score stratifies ME/CFS symptom severity and analyte detection. J. Transl. Med. 2018, 16, 97. [Google Scholar] [CrossRef] [S4ME]

Ref 29 has 80 participants so that matches with this study, but ref 30 has 45 participants, so I think that's a different cohort.

Edit: It says again here that ref 30 is the same cohort, so maybe it was a subset of this current study's cohort:
This ME/CFS cohort has been described previously and compared to a non-ME/CFS (healthy) control group in relation to pathology markers and serum activin B, with symptom severity assessed via Weighted Standing Time (WST) [29,30].
 
Last edited:
I wish they included a count of n/77 for cases with the variant so we could get a better feel for the data. I couldn't find where they pulled the MAF numbers from for comparing against (sorry, can only scan and Ctrl-F search papers nowadays).
Supplementary table S1 seems to have some sort of extra data but I'm not sure what it is listing.

When analysing GWAS you need to be aware of miscalls for the platform you are using. I'm aware of some studies finding very high frequency vs low MAF or vice versa in cases and it turned out to be known mis-calls of the technology used, or getting the MAF very wrong. The most famous in ME/CFS is the Perez et al 23andMe ME/CFS paper that didn't clean the dataset beforehand. That is why I am dubious of such low p values.
RE: Re-analysis of Genetic Risks for Chronic Fatigue Syndrome From 23andMe Data Finds Few Remain

I have much more faith in DecodeME because patients and controls use the same sequencing technology and process.
 
Last edited:
GWAS Analysis: As mentioned in the introduction, given the low prevalence of ME we used 323 individuals belonging to Caucasian communities recruited and genotyped by the 1000 Genome Project (1 K Genomes) as controls [40]: Utah residents (CEPH) with Northern and Western European Ancestry (CEU, n = 32), Finnish individuals in Finland (FIN, n = 93), British individuals in England and Scotland (GBR, n = 86), an Iberian population in Spain (IBS, n = 14), and individuals from Tuscany in Italy (TSI, n = 98). The potential bias was minimal given the rareness of the ME phenotype.
Considering their control group is just individuals from the general population and they didn't attempt to limit to those without ME/CFS, why would they limit it to 323 individuals? Couldn't they have used a few hundred thousand individuals from a database like the UK BioBank to increase statistical power?
 
This is also the same cohort as their previous paper that found Complex V insufficiency. (That paper says 51 participants though, so not exactly the same.)
The significance of ALDH18A1, in addition to its role in metabolic perturbations, draws attention to mitochondrial function in ME/CFS patients. While future functional studies on the myriad neurodevelopment genes are needed to confirm their significance, mitochondrial function studies have been conducted on the same patient cohort as presented here [84]. Notable for ME/CFS dysfunction were Complex V insufficiency combined with TORC-1 increases in comparison to healthy (non-ME) control participants. Whether ALDH18A1 is directly involved in this ME/CFS mitochondrial function profile will require further investigation.

[84] Missailidis, D.; Annesley, S.J.; Allan, C.Y.; Sanislav, O.; Lidbury, B.A.; Lewis, D.P.; Fisher, P.R. An Isolated Complex V Inefficiency and Dysregulated Mitochondrial Function in Immortalized Lymphocytes from ME/CFS Patients. Int. J. Mol. Sci. 2020, 21, 1074. [Google Scholar] [CrossRef] [S4ME]
 
Just looking at the first one, NBPF1 also has high expression outside the nervous system, including in bone marrow, lymph nodes and skeletal muscle.
https://www.genecards.org/cgi-bin/carddisp.pl?gene=NBPF1#expression
It always baffles me that studies will talk about only association with the brain when that gene is expressed (even at higher levels) in so many other tissues. I think it's because neuroscience has simply done quite a thorough job of functionally categorizing relevant genes in their neurological context, and that effort has not been matched in other fields. This can often give a skewed an impression that the brain must be involved when really it could be a whole host of tissues. There doesn't seem to be any functional data in this paper to confirm that it is the neurological actions of those genes that are relevant.

I'll also echo other concerns about the appropriateness of using a small-cohort genomic analysis method here.
 
I take a slightly different view. Some of us on this thread (at least three) co-authored a paper nearly ten years ago (OK it may all have been my fault) suggesting that there was some real biology deep behind MECFS and that it was likely either in CNS or immune cells or both. So genes expressed in CNS fit a story, even if they are expressed elsewhere.
 
When analysing GWAS you need to be aware of miscalls for the platform you are using. I'm aware of some studies finding very high frequency vs low MAF or vice versa in cases and it turned out to be known mis-calls of the technology used, or getting the MAF very wrong. The most famous in ME/CFS is the Perez et al 23andMe ME/CFS paper that didn't clean the dataset beforehand. That is why I am dubious of such low p values.
RE: Re-analysis of Genetic Risks for Chronic Fatigue Syndrome From 23andMe Data Finds Few Remain
Thanks for linking that interesting paper @wigglethemouse. I missed it when it came out - here's the forum thread link.
 
I take a slightly different view. Some of us on this thread (at least three) co-authored a paper nearly ten years ago (OK it may all have been my fault) suggesting that there was some real biology deep behind MECFS and that it was likely either in CNS or immune cells or both. So genes expressed in CNS fit a story, even if they are expressed elsewhere.
The CNS may well be involved somehow in ME/CFS, I’m commenting on a tendency in papers across fields to implicate neurological involvement solely on the basis of genes that happen to be extensively characterized in the CNS but are also expressed elsewhere.

It’s one of my biggest pet peeves with CX3CR1 or GFAP CRE mouse models—people show a significant finding and claim it must be due to microglia or astrocytes when there are a ton of other places where those genes are expressed. Happens in genomics studies all the time as well.
 
This paper looks to have serious quality issues. I took the first three variants in Table 1 (Top ranked genes) and all three have known quality issues noted on the Broad Institute gnomAD browser/database.

NBPF1, rs3897177, SNV:1-16909052-C-T (GRCh37) Warning : Genomes failed random forest filter
Link

NBPF10, rs10910794, SNV:1-145303971-A-G (GRCh37) Warning: Genomes have inbreed coeff. < -0.3
Link

NBPF10, rs1553120233, SNV:1-145355624-C-T (GRCh37) Warning : No matching variants found. rs1553120233
Link
Search based on location 1-145355624-C-T = rs112674709
Warning: This variant is covered in fewer than 50% of individuals in gnomAD v2.1.1 genomes. This may indicate a low-quality site.
Link

I'm not going to check any more.
 
Last edited:
If you have time @wigglethemouse, can you explain a bit about what those warnings mean?
I don't understand the first 2 warnings but I assume they mean some sort of quality issue when Broad run their automated checks. My take is it means there is dodgy data reported in some WGS databases for this variant so be careful of using this data.

EDIT : see following two posts for Google AI explanation.

The third one is easier. SNIP doesn't exist. Searching by location shows a site with low coverage by WGS which often means companies have discarded the data due to low quality sequencing which happens for some regions of the genome.
 
Last edited:
Genomes failed random forest filter
Here is Google AI Response:

If the Broad Institute indicates that a genome (or variant) "failed random forest filter," it means the quality control process flagged it based on a random forest model
.
Here's a breakdown:

Random Forest Model in Genome QC: In the context of genomics, especially at the Broad Institute's gnomAD project, a random forest is used to distinguish between high-quality (likely true) and low-quality (likely false positive) genetic variants. It works by using a combination of quality metrics from the sequencing data as input features, like those mentioned in the gnomAD news release. The model is trained using known high-quality and low-quality variants, often identified using hard filters.
Reasons for Failure: A genome or variant can fail the random forest filter due to several factors related to the sequencing data quality and variant calling:
Low mapping quality: Reads might not be mapped correctly to the reference genome, especially in regions with repetitive sequences.
Deviations from Hardy-Weinberg equilibrium: Variants that show significant deviations from expected allele frequencies in a population might be due to errors.
Outliers in quality assessments: Variants that are outliers in quality metrics used by the random forest model, such as sequencing depth, genotype quality, or allele balance, are more likely to be flagged as low-quality.
Mismapped reads: Reads from homologous regions or pseudogenes might be mismapped, leading to incorrect variant calls.
Poor quality sequencing data: Overall poor data quality from the sequencing process can result in low-quality variant calls.
Importance of Filtering: Filtering low-quality variants is crucial to ensure the accuracy of genetic studies and prevent spurious findings. The random forest filter provides a more sophisticated approach compared to traditional hard filters, which can be too stringent and filter out true variants.

In essence, a genome failing the random forest filter suggests that the variants identified within that genome have characteristics that the model associated with low quality and a higher likelihood of being false positives.
 
Last edited:
Warning: Genomes have inbreed coeff. < -0.3
Google AI interpretation:

If the Broad Institute's GATK (Genome Analysis Toolkit) provides a warning about a genome's inbreeding coefficient, it likely indicates a higher than expected level of homozygosity at certain genetic locations (sites)
.
Meaning of Inbreeding Coefficient in this Context:

Excess Heterozygosity: While "inbreeding coefficient" generally suggests inbreeding (mating of related individuals), GATK's calculation in this context focuses on the excess heterozygosity at a variant site.
Deviation from Hardy-Weinberg Equilibrium: A value of 0 indicates the site is in Hardy-Weinberg Equilibrium, a theoretical state where allele and genotype frequencies remain constant across generations in a population.
Negative Values and Potential Issues: Negative values of the Inbreeding Coefficient could suggest an excess of heterozygotes and potentially indicate poor mapping of DNA sequences in that genomic region. Incorrect mapping can lead to artificial variants being called as heterozygous when they are not, or vice versa.
Positive Values and Potential Issues: Positive values suggest a deficiency of heterozygotes, which could be due to inbreeding or other factors like population admixture.

Why the Warning?

Quality Control: The Inbreeding Coefficient is used as a quality control metric to identify problematic variant calls. High (or significantly negative) values can alert researchers to potential issues with the data, including problems with the sequencing or analysis.
Filtering Variants: The Broad Institute recommends filtering out variants with negative Inbreeding Coefficients, as these can be a sign of unreliable data. While positive values might indicate inbreeding, filtering these out is not recommended unless other factors suggest data quality issues.

In summary, a warning about a genome's inbreeding coefficient from the Broad Institute's GATK often highlights potential data quality issues at specific genomic locations rather than definitively diagnosing inbreeding
 
That's interesting. gnomAD doesn't list that and the Kaviar database can't find either rs1553120233 or rs112674709. Kaviar database search on location shows a T allele frequency of 13%, not far off the gnomAD of 18% but does NOT list an rsID (it normally does). OpenSNP has shutdown (end of April 2025).

Either way the 1-145355624-C-T variant has a quality warning on gnomAD.
 
Last edited:
I'm not sure I understand why these warnings relate to this study that looked at new genomes.
The control data in the study comes from a small subset of a genetic database, 1000 Genomes I think. The case data comes from the sequencing used in the study.

gnomAD is saying the WGS data is not consistent across databases indicating a possible/probable issue with the data in the databases. We don't know if 1000 Genomes or the study sequencing has accurate interpretation of the raw data if in general that variant location can be problematic.

Is it that the warnings indicate that these positions are hard to read accurately? Doesn't that depend on the sequencing tech?
Often the location in question is hard to read accurately. For example a read is normally made by joining lots of little pieces DNA of data together. That can be problematic if there are a lot of repeat sequences - the pieces can get assembled wrong. So the same issue can appear across different WGS interpretation/sequencing processes. One reason why it could be flagged by gnomAD.

Long read sequencing can overcome some of this issue but is expensive.
 
Last edited:
Nevada replication


PTPRD

That gene is covering an awful lot of ground - even just the first two things mentioned 'cell growth and differentiation' cover an enormous number of possible things that might be going wrong in a wide range of places.

And again, there might be a problem caused by the variants, but it might be reflecting the cohorts putting themselves forward for research being special, and perhaps high-functioning, in some way. Take that last sentences 'Variants harboured in PTRPD have also been associated with susceptibility to the de-development of neurofibrillary tangles'. 'De-development' is actually a good thing I think - pharmaceutical companies are trying to find out how to de-develop neurofibrillary tangles in Alzheimers.

I'm not sure, but 'variants harboured' makes it sound as though the authors may have identified these variants in their cohort.

The things we know genes do map to diseases that have funding. Every gene with a study on it is an oncogene because cancer funding is wide and deep like the Pacific ocean!
And as @jnmaciuch points out, every other gene appears to be a brain gene, because neurology has done a lot of work too.
 
Last edited:
Back
Top Bottom