Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

Also found this Stanford website with info on enhancers and what has been found in other GWAS:

Think this is the preprint that introduced it:
Abstract
Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1–6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics
 
Should we perhaps do the same thing for a couple of other insertions in this region? If they have a similar effect it would be quite interesting
I looked at one more deletion in another TF-bound regulatory region right next to the most significant SNP a little earlier in the thread. There’s a loss of an AP1 binding site—AP1 has been found to [edit: be relevant in a lot of immune responses including interferon/IL-1B, though it does a lot of other things], so it might not be a relevant connection.

[Edited to correct the paper link, accidentally linked one for a different protein complex also called AP1, not relevant here]

The loss of an estrogen receptor motif there could be interesting too, though unfortunately binding sites don’t necessarily tell us much about whether that TF enhances or downregulates transcription of nearby genes. The eQTL data might give a hint, but it also could be a red herring if the basal expression of the gene (what’s usually captured in eQTL data) doesn’t reflect what happens under certain functionally relevant conditions (like active infection or a shift in hormonal cycle)

When I have a chance I’ll look to see if more SNPs overlap in known regulatory regions. The ENCODE cCRE track on UCSC genome browser that I linked above can help with that if anyone else is curious and has time.

The only differences I see are the data source and the assay, though I don't know what polyA means here.
Oh sorry I missed this, that means polyA enrichment. poly adenylation is a string of adenosines at the end of an mRNA transcript that prevents degradation and can serve as a “docking site” for other proteins involved in translation.

You can add a step to a RNA-seq procedure that pulls out transcripts with that feature—it means that the experiment is preferentially sequencing and quantifying RNA transcripts that are more likely to be translated into proteins. This is often done because a lot of RNA codes for somewhat uninteresting things (like an abundance of ribosomal subunits) and since the RNA-seq protocol can only randomly sample a limited amount of transcripts, a lot of potentially interesting but less abundant features can end up just not getting sequenced
 
Last edited:
Should we perhaps do the same thing for a couple of other insertions in this region? If they have a similar effect it would be quite interesting.

Take for example:

20:49012496_T/TTTTG
Ref. Allele: T
P Value: 2.65 × 10^-9
If you're wondering about the AlphaGenome scores for this variant, it's pretty much the same as other indels in the area, in terms of high scores (though with some sign flips).
chr20:49012496:T>TTTTG
RNA_SEQ
UBERON:0000955 (brain)
GeneQuantile score
STAU10.9996956587
ZNFX1-0.999494493
DDX27-0.9993778467
CSE1L0.9991207123
KCNB10.9943153262
ARFGEF2-0.9936224222
PREX10.6281188726

Edit: This is from the excel file in this post: https://www.s4me.info/threads/advan...lphagenome-2026-avsec-et-al.48555/post-672137
 
Last edited:
I need to catch up with this thread but looks like an interesting discussion. I’ve been throwing some ideas around and playing with some approaches to look at lots of SNPs at once. I got loads of data to start so have been trying to be a bit smarter and narrow things down. Not sure what it all means but will try to share some scripts soon. Definitely lots more work needed and not really sure I’m up to it at the moment but maybe it’ll be useful to build upon.

When looking at LDLink to do some LD handling I noticed some errors cropping up, looking at LocusZoom for these is interesting too, what do people think is happening here? As ever, unsure if it’s something interesting or I’ve messed up somewhere!
chr22:18731211 Variant is not in 1000G reference panel (is this just no data from DecodeME or something in DecodeME but not the reference panel?)
chr22:48057352 Variant is monoallelic in the chosen population

Edit: so the above issues were when I was accidentally skipping validation against the decodeme qc data
 
Last edited:
Looks like there were some remaining bugs which explains some of what I was seeing. And there are probably still more!

Anyway, I’ve not tested everything (particularly running from scratch without cached data or the options to replace hardcoded values I was using) and it’s a blag of borrowed (thanks @forestglip and @ME/CFS Science Blog ) and LLM generated code but the top hits are now as expected when I run here.

Summary of Identified Loci:
CHROM GENPOS ID LOG10P CS_Size
0 20 48914387 20:48914387:T:TA 11.02190 60
1 17 52183006 17:52183006:C:T 8.67492 58
2 6 26239176 6:26239176:A:G 8.38620 60
3 15 54866724 15:54866724:A:G 8.11788 134
4 1 173846152 1:173846152:T:C 7.59142 65
5 6 97984426 6:97984426:C:CA 7.31390 88

So I’m going to stop now, it’s been a brain frying few days, but here is a git repo with the scripts and some results. Hope they’re of use or interest. Not sure I’ll be able to answer any questions but there’s a generated README and feel free to use/modify/fix the code.

And raw csv of the filtered results from running on decodeme are included here (this is only 8MB, unfiltered is 1.2GB of data) as well as unique genes identified.

There is still a variant I hit not in the 1000G EUR reference panel
chr6:9798442
But fallback to nearest significant neighbour should handle this okish… I think?

There’s a few hundred SNPs checked in total across the 6 significant loci (cs_size is the credible set size) and a few tissue types looked at but these could be easily expanded. My goal was just to just get some relevant seeming data from AlphaGenome for a selection of our most interesting regions and SNPs while doing so in a more refined waynthan my first pass with a bit of sanity checking and handling of LD. Hopefully it does that and can be built upon if needed.
 
Last edited:
The gene expression predictions were the only ones I tested so far (because that's the only ones I really understand what they are), but there are several other prediction types that might be useful.

Maybe while the top variant in DecodeME predicts changes in expression in lots of genes, there could be a singular causal factor for so many genes to change which might be seen with one of the following other models.

Gene Expression (RNA-seq)​

Variant scores quantify the impact on overall gene transcript abundance.
  • comparison: predicted RNA coverage between REF and ALT alleles.
  • mask: exons for a gene of interest.
  • aggregation: Log-fold change of gene expression level between the ALT and REF alleles: .

Polyadenylation Site (PAS) Usage​

This follows Borzoi’s [Linder et al., 2025] methodology for scoring polyadenylation quantitative trait loci (paQTLs), which captures the variant’s impact on RNAisoform production.
  • comparison: predicted RNA coverage between REF and ALT alleles.
  • mask: local 400-bp windows around 3’ cleavage junctions.
  • aggregation: Maximum absolute log-fold change of isoform ratios(distal/proximal PAS usage) between REF and ALT, considering all proximal/distal splits.

TSS Activity (CAGE, PRO-cap)​

Variant scores quantify local changes at TSSs.
  • comparison: predicted CAGE or PRO-cap coverage between REF and ALTalleles.
  • mask: local 501-bp window centered at the variant.
  • aggregation: Log2-ratio of summed signals: .

Chromatin Accessibility (ATAC-seq, DNase-seq)​

Variant scores quantify local accessibility changes.
  • comparison: predicted ATAC-seq or DNase-cap coverage between REF and ALTalleles.
  • mask: local 501-bp window centered at the variant.
  • aggregation: Log2-ratio of summed signals: .

Transcription Factor Binding (ChIP-TF)​

Variant scores quantify changes in TF binding intensity.
  • comparison: predicted ChIP-TF coverage between REF and ALT alleles.
  • mask: local 501-bp window centered at the variant.
  • aggregation: Log2-ratio of summed signals: .

Histone Modifications (ChIP-Histone)​

Variant scores quantify changes in histone modifications.
  • comparison: predicted ChIP-Histone coverage between REF and ALT alleles.
  • mask: local 2001-bp window centered at the variant.
  • aggregation: Log2-ratio of summed signals: .

Splicing (Splice Sites)​

Variant scores quantify changes in the class assignment probabilities (acceptor, donor) at all potential splice sites within a gene body.
  • comparison: class assignment probabilities for REF and ALT alleles.
  • mask: gene body for a gene of interest.
  • aggregation: Maximum absolute difference of predicted splice site probabilities across the gene body: .

Splicing (Splice Site Usage)​

Variant scores quantify changes in the usage of splice sites (i.e., increased or decreased fractions).
  • comparison: predicted splice site usage between REF and ALT alleles.
  • mask: gene body for a gene of interest.
  • aggregation: Maximum absolute difference of predicted splice site usage across the gene body: .

Splicing (Splice Junctions)​

Variant scores quantify changes in the predicted RNA-seq reads spanning a junction, which is a function of both expression level, splice site usage and splicing efficiency.
  • comparison: predicted paired junction counts between REF and ALTalleles.
  • mask: top-k splice sites for a gene of interest (including annotated and predicted splice sites).
  • aggregation: Maximum absolute log-fold change of predicted junction counts across splice site pairs of interest: .

3D Genome Contact (Contact Maps)​

Variant scores quantify local contact disruption.
  • comparison: predicted contact frequencies between REF and ALT alleles.
  • mask: local 1MB window centered at the variant.
  • aggregation: Mean absolute difference of contact frequencies, for allinteractions involving the variant-containing bin.
 
Back
Top Bottom