Advancing regulatory variant effect prediction with AlphaGenome 2026 Avsec et al

Also found this Stanford website with info on enhancers and what has been found in other GWAS:

Think this is the preprint that introduced it:
Abstract
Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1–6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics
 
Should we perhaps do the same thing for a couple of other insertions in this region? If they have a similar effect it would be quite interesting
I looked at one more deletion in another TF-bound regulatory region right next to the most significant SNP a little earlier in the thread. There’s a loss of an AP1 binding site—AP1 has been found to [edit: be relevant in a lot of immune responses including interferon/IL-1B, though it does a lot of other things], so it might not be a relevant connection.

[Edited to correct the paper link, accidentally linked one for a different protein complex also called AP1, not relevant here]

The loss of an estrogen receptor motif there could be interesting too, though unfortunately binding sites don’t necessarily tell us much about whether that TF enhances or downregulates transcription of nearby genes. The eQTL data might give a hint, but it also could be a red herring if the basal expression of the gene (what’s usually captured in eQTL data) doesn’t reflect what happens under certain functionally relevant conditions (like active infection or a shift in hormonal cycle)

When I have a chance I’ll look to see if more SNPs overlap in known regulatory regions. The ENCODE cCRE track on UCSC genome browser that I linked above can help with that if anyone else is curious and has time.

The only differences I see are the data source and the assay, though I don't know what polyA means here.
Oh sorry I missed this, that means polyA enrichment. poly adenylation is a string of adenosines at the end of an mRNA transcript that prevents degradation and can serve as a “docking site” for other proteins involved in translation.

You can add a step to a RNA-seq procedure that pulls out transcripts with that feature—it means that the experiment is preferentially sequencing and quantifying RNA transcripts that are more likely to be translated into proteins. This is often done because a lot of RNA codes for somewhat uninteresting things (like an abundance of ribosomal subunits) and since the RNA-seq protocol can only randomly sample a limited amount of transcripts, a lot of potentially interesting but less abundant features can end up just not getting sequenced
 
Last edited:
Should we perhaps do the same thing for a couple of other insertions in this region? If they have a similar effect it would be quite interesting.

Take for example:

20:49012496_T/TTTTG
Ref. Allele: T
P Value: 2.65 × 10^-9
If you're wondering about the AlphaGenome scores for this variant, it's pretty much the same as other indels in the area, in terms of high scores (though with some sign flips).
chr20:49012496:T>TTTTG
RNA_SEQ
UBERON:0000955 (brain)
GeneQuantile score
STAU10.9996956587
ZNFX1-0.999494493
DDX27-0.9993778467
CSE1L0.9991207123
KCNB10.9943153262
ARFGEF2-0.9936224222
PREX10.6281188726

Edit: This is from the excel file in this post: https://www.s4me.info/threads/advan...lphagenome-2026-avsec-et-al.48555/post-672137
 
Last edited:
I need to catch up with this thread but looks like an interesting discussion. I’ve been throwing some ideas around and playing with some approaches to look at lots of SNPs at once. I got loads of data to start so have been trying to be a bit smarter and narrow things down. Not sure what it all means but will try to share some scripts soon. Definitely lots more work needed and not really sure I’m up to it at the moment but maybe it’ll be useful to build upon.

When looking at LDLink to do some LD handling I noticed some errors cropping up, looking at LocusZoom for these is interesting too, what do people think is happening here? As ever, unsure if it’s something interesting or I’ve messed up somewhere!
chr22:18731211 Variant is not in 1000G reference panel (is this just no data from DecodeME or something in DecodeME but not the reference panel?)
chr22:48057352 Variant is monoallelic in the chosen population

Edit: so the above issues were when I was accidentally skipping validation against the decodeme qc data
 
Last edited:
Looks like there were some remaining bugs which explains some of what I was seeing. And there are probably still more!

Anyway, I’ve not tested everything (particularly running from scratch without cached data or the options to replace hardcoded values I was using) and it’s a blag of borrowed (thanks @forestglip and @ME/CFS Science Blog ) and LLM generated code but the top hits are now as expected when I run here.

Summary of Identified Loci:
CHROM GENPOS ID LOG10P CS_Size
0 20 48914387 20:48914387:T:TA 11.02190 60
1 17 52183006 17:52183006:C:T 8.67492 58
2 6 26239176 6:26239176:A:G 8.38620 60
3 15 54866724 15:54866724:A:G 8.11788 134
4 1 173846152 1:173846152:T:C 7.59142 65
5 6 97984426 6:97984426:C:CA 7.31390 88

So I’m going to stop now, it’s been a brain frying few days, but here is a git repo with the scripts and some results. Hope they’re of use or interest. Not sure I’ll be able to answer any questions but there’s a generated README and feel free to use/modify/fix the code.

And raw csv of the filtered results from running on decodeme are included here (this is only 8MB, unfiltered is 1.2GB of data) as well as unique genes identified.

There is still a variant I hit not in the 1000G EUR reference panel
chr6:9798442
But fallback to nearest significant neighbour should handle this okish… I think?

There’s a few hundred SNPs checked in total across the 6 significant loci (cs_size is the credible set size) and a few tissue types looked at but these could be easily expanded. My goal was just to just get some relevant seeming data from AlphaGenome for a selection of our most interesting regions and SNPs while doing so in a more refined waynthan my first pass with a bit of sanity checking and handling of LD. Hopefully it does that and can be built upon if needed.
 
Last edited:
Posts moved from: Genetics: Chromosome 20: ARFGEF2, CSE1L, STAU1


I think we need a formal seminar in plain Anglo Saxon to tell us what has been discovered.

There's a lot more for me to read about it to understand, but it's basically a machine learning model trained with lots of genetic and other data to look at a string of DNA (up to 1 million base pairs) and predict various things like how much RNA will be expressed all along the strand you gave it.

The neat thing is that you can give it two strands of DNA, one of which has a variant switched to a different allele, and get two predictions, then see what the model thinks will change about gene expression given the change in one variant. If it's confident that there will be an increase of expression in a specific gene, or change in chromatin accessibility related to that gene, or something else, it might act as supporting evidence for that gene being the target of the variant. (It seems they also talk about using it for general foundational understanding of DNA, not just for seeing changes due to variants.)

The thing is, I'm not sure how useful it is if you don't know which specific variant is causal. Lots of variants are likely to do lots of things, so if you ask it what hundreds of variants do, I think you'll get too much data back to make sense of.

Maybe when we have it narrowed down much further to specific variants, it might be more useful.

@jnmaciuch was posting in that thread about lots of transcription factors binding at the site of many of the most significant variants in chr20, but I'm not sure AlphaGenome was really needed for that.
 
Last edited:
but I'm not sure AlphaGenome was really needed for that.
Nope that’s just good ole fashioned grunt work. Mostly I was trying to figure out if the fact that alphagenome predicted so many genes being strongly affected by one variant was biologically plausible, or if it can be written off as the model hallucinating. If the variants overlap TF binding regions (or are in strong LD with other variants that do), I think it is biologically plausible.
It seems they also talk about using it for general foundational understanding of DNA, not just for seeing changes due to variants
I have a sneaking suspicion that the model isnt leaning novel patterns of genome regulation so much as it is just a fancy summarizer of eQTL data with extra steps. Being able to assess that would require coming up with some creative tests though. For the purposes of understanding DecodeME data I suspect it’s not really an important distinction.
 
Last edited:
The gene expression predictions were the only ones I tested so far (because that's the only ones I really understand what they are), but there are several other prediction types that might be useful.

Maybe while the top variant in DecodeME predicts changes in expression in lots of genes, there could be a singular causal factor for so many genes to change which might be seen with one of the following other models.

Gene Expression (RNA-seq)​

Variant scores quantify the impact on overall gene transcript abundance.
  • comparison: predicted RNA coverage between REF and ALT alleles.
  • mask: exons for a gene of interest.
  • aggregation: Log-fold change of gene expression level between the ALT and REF alleles: .

Polyadenylation Site (PAS) Usage​

This follows Borzoi’s [Linder et al., 2025] methodology for scoring polyadenylation quantitative trait loci (paQTLs), which captures the variant’s impact on RNAisoform production.
  • comparison: predicted RNA coverage between REF and ALT alleles.
  • mask: local 400-bp windows around 3’ cleavage junctions.
  • aggregation: Maximum absolute log-fold change of isoform ratios(distal/proximal PAS usage) between REF and ALT, considering all proximal/distal splits.

TSS Activity (CAGE, PRO-cap)​

Variant scores quantify local changes at TSSs.
  • comparison: predicted CAGE or PRO-cap coverage between REF and ALTalleles.
  • mask: local 501-bp window centered at the variant.
  • aggregation: Log2-ratio of summed signals: .

Chromatin Accessibility (ATAC-seq, DNase-seq)​

Variant scores quantify local accessibility changes.
  • comparison: predicted ATAC-seq or DNase-cap coverage between REF and ALTalleles.
  • mask: local 501-bp window centered at the variant.
  • aggregation: Log2-ratio of summed signals: .

Transcription Factor Binding (ChIP-TF)​

Variant scores quantify changes in TF binding intensity.
  • comparison: predicted ChIP-TF coverage between REF and ALT alleles.
  • mask: local 501-bp window centered at the variant.
  • aggregation: Log2-ratio of summed signals: .

Histone Modifications (ChIP-Histone)​

Variant scores quantify changes in histone modifications.
  • comparison: predicted ChIP-Histone coverage between REF and ALT alleles.
  • mask: local 2001-bp window centered at the variant.
  • aggregation: Log2-ratio of summed signals: .

Splicing (Splice Sites)​

Variant scores quantify changes in the class assignment probabilities (acceptor, donor) at all potential splice sites within a gene body.
  • comparison: class assignment probabilities for REF and ALT alleles.
  • mask: gene body for a gene of interest.
  • aggregation: Maximum absolute difference of predicted splice site probabilities across the gene body: .

Splicing (Splice Site Usage)​

Variant scores quantify changes in the usage of splice sites (i.e., increased or decreased fractions).
  • comparison: predicted splice site usage between REF and ALT alleles.
  • mask: gene body for a gene of interest.
  • aggregation: Maximum absolute difference of predicted splice site usage across the gene body: .

Splicing (Splice Junctions)​

Variant scores quantify changes in the predicted RNA-seq reads spanning a junction, which is a function of both expression level, splice site usage and splicing efficiency.
  • comparison: predicted paired junction counts between REF and ALTalleles.
  • mask: top-k splice sites for a gene of interest (including annotated and predicted splice sites).
  • aggregation: Maximum absolute log-fold change of predicted junction counts across splice site pairs of interest: .

3D Genome Contact (Contact Maps)​

Variant scores quantify local contact disruption.
  • comparison: predicted contact frequencies between REF and ALT alleles.
  • mask: local 1MB window centered at the variant.
  • aggregation: Mean absolute difference of contact frequencies, for allinteractions involving the variant-containing bin.
 
I have a sneaking suspicion that the model isnt leaning novel patterns of genome regulation so much as it is just a fancy summarizer of eQTL data with extra steps.
I'm not sure if this is what you mean, but I was initially also thinking that it was basically outputting the same things you'd find in GTEx. But I don't think it's that.

I'm pretty sure it doesn't "know" where in the genome the strand of nucleotides you give it is, or where any genes are. I think the training was just giving it unlabeled strands of DNA, and making it predict how much each nucleotide along that DNA is expected to be expressed - all just based on patterns of nucleotides.

So when it says "high expression in so-and-so region", that's just based on it's training telling it that when it sees some nucleotides show up in a certain order, we should expect to see high expression in a certain area. Matching up actual gene names to the output of the model is done after the prediction is done. It's basing predictions from having learned patterns from all across the 3 billion base pairs of DNA from many individuals, not just the specific region of interest.

So this means it could predict the consequences of variants that have never shown up in an eQTL database. I think you could probably even give it DNA from some random animal that it wasn't trained on, and it would be able to predict where and how much genes are being expressed (though maybe not as well).

I probably wouldn't have thought much about this if it was from a research group I hadn't heard of, but Google's Alpha models (e.g. AlphaFold, AlphaGo, AlphaZero) seem to have a reputation for being the cutting edge of machine learning.
 
Last edited:
So this means it could predict the consequences of variants that have never shown up in an eQTL database.
Yes that’s the promise. The issue I’m worried about is whether it actually can deliver. You’re inputting strings of nucleotides and getting out eQTL predictions (with the gene names attributed after the fact), but that was also the format for the training data: input sequences with known eQTL output (translated such that the algorithm didn’t “know” what genes they belonged to, but it is still possible to learn the associations between input sequences and quantification of output sequences). If your query overlaps anything in the training data just at the level of sequence, things will probably look good.

though maybe not as well
That’s pretty much the gist of it, and what I’m seeing with various in-silico models that I’m testing for my thesis. Fantastic performance on training data (obviously), somewhat decent performance on something with very high similarity to an instance that’s already in the training data, and any degree farther…..disappointment. You make a model hoping that it’s learning generalizable patterns. The reality tends to be a lot of hype that dissolves away after a few months. So it really just ends up being a glorified summarization tool of the training data more than anything else.

The real assessment would come from having experimental data confirming the predicted results of something several degrees different from what is already seen in the training data. I’m sure they’ve already tried to do a few of those test cases for show. I tend to hold out until other independent teams with no stake in the tool’s success can show the same thing.

I probably wouldn't have thought much about this if it was from a research group I hadn't heard of, but Google's Alpha models (e.g. AlphaFold, AlphaGo, AlphaZero) seem to have a reputation for being the cutting edge of machine learning.
Predicting protein folding is somewhat of a different case because the possible inputs and possible outputs are much more constrained, and even then I’ve heard many a grad student complaining about predictions that made no sense. I’m sure this team is more rigorous than others in the AI space, and I’d be happy if these tools stand the test of time. I guess I’ve just learned my lesson re:AI hype one too many times.
 
I also put together a visualisation tool to go with my earlier script, on the dev branch, that can be run locally and accessed through a browser. I can guarantee this will be full of incorrect scientific assumptions, behaviour which mischaracterises data and misunderstandings of APIs as well as bugs!

It’s not meant to be anything more than something for experimenting with, a toy, something to help me/us understand the tools more than the data, despite the slightly puffed up readme (I got gemini to generate these based upon the code and chat history while making this). But you can select snps, tissues and generate plots from alphagenome nice and quickly. So again, maybe something to pick up and build upon later if/when we understand things more.
 
Last edited:
Back
Top Bottom