Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

Also found this Stanford website with info on enhancers and what has been found in other GWAS:

Think this is the preprint that introduced it:
Abstract
Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1–6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics
 
Should we perhaps do the same thing for a couple of other insertions in this region? If they have a similar effect it would be quite interesting
I looked at one more deletion in another TF-bound regulatory region right next to the most significant SNP a little earlier in the thread. There’s a loss of an AP1 binding site—AP1 has been found to [edit: be relevant in a lot of immune responses including interferon/IL-1B, though it does a lot of other things], so it might not be a relevant connection.

[Edited to correct the paper link, accidentally linked one for a different protein complex also called AP1, not relevant here]

The loss of an estrogen receptor motif there could be interesting too, though unfortunately binding sites don’t necessarily tell us much about whether that TF enhances or downregulates transcription of nearby genes. The eQTL data might give a hint, but it also could be a red herring if the basal expression of the gene (what’s usually captured in eQTL data) doesn’t reflect what happens under certain functionally relevant conditions (like active infection or a shift in hormonal cycle)

When I have a chance I’ll look to see if more SNPs overlap in known regulatory regions. The ENCODE cCRE track on UCSC genome browser that I linked above can help with that if anyone else is curious and has time.

The only differences I see are the data source and the assay, though I don't know what polyA means here.
Oh sorry I missed this, that means polyA enrichment. poly adenylation is a string of adenosines at the end of an mRNA transcript that prevents degradation and can serve as a “docking site” for other proteins involved in translation.

You can add a step to a RNA-seq procedure that pulls out transcripts with that feature—it means that the experiment is preferentially sequencing and quantifying RNA transcripts that are more likely to be translated into proteins. This is often done because a lot of RNA codes for somewhat uninteresting things (like an abundance of ribosomal subunits) and since the RNA-seq protocol can only randomly sample a limited amount of transcripts, a lot of potentially interesting but less abundant features can end up just not getting sequenced
 
Last edited:
Should we perhaps do the same thing for a couple of other insertions in this region? If they have a similar effect it would be quite interesting.

Take for example:

20:49012496_T/TTTTG
Ref. Allele: T
P Value: 2.65 × 10^-9
If you're wondering about the AlphaGenome scores for this variant, it's pretty much the same as other indels in the area, in terms of high scores (though with some sign flips).
chr20:49012496:T>TTTTG
RNA_SEQ
UBERON:0000955 (brain)
GeneQuantile score
STAU10.9996956587
ZNFX1-0.999494493
DDX27-0.9993778467
CSE1L0.9991207123
KCNB10.9943153262
ARFGEF2-0.9936224222
PREX10.6281188726

Edit: This is from the excel file in this post: https://www.s4me.info/threads/advan...lphagenome-2026-avsec-et-al.48555/post-672137
 
Last edited:
Back
Top Bottom