What Can Epigenomic Foundation Models Reveal About Biology?
Welcome to the first in a series of posts from Epigenome Technologies focusing on the application of foundation model techniques (or "deep learning") to epigenetic datasets, which aims to provide a variety of perspectives on this area of research, including overviews of major models, potential real-world applications, worked examples of fine-tuning, and deep dives into technical underpinnings. This first post centers on sequence-to-function models that predict molecular outcomes from DNA sequence alone: EVO-2, AlphaGenome, Borzoi, and a suite of models, GENA-LM. A subsequent post will concentrate on function-to-function models, which predict (or impute) molecular states from other, related molecular states.
Biological cells maintain protein levels on timescales far exceeding protein half-lives to sustain cellular function; however, protein levels drift into dysfunction or even malignancy under pathological conditions. Maintenance or drift both occur in the nucleus, the site of mRNA production encoding the next generation of proteins. The epigenetic landscape coordinates the activity of RNA polymerase and cofactors and therefore serves as a feedback nexus, responsible for environmental buffering, memory, and entrapping cells in malfunctioning states. Indeed, epigenetic predisposition determines the efficacy of cellular (re-)programming of somatic cells to pluripotency; modern differentiation protocols incorporate epigenetic drugs (e.g., EZH2, HDAC, or DOT1L inhibitors) in combination with signaling factors into (de-)differentiation-inducing media, which reveals how chromatin state can stymie even "robust" somatic cell reprogramming using reprogramming factor cocktails such as the classic "OSKM" or "OSNL" combinations.
Understanding the fine details of homeostasis, environmental perturbations, disease onset, and progression will require cell-resolution data and models that can generalize despite high complexity and partial observability. Given the physical nature of the nucleus (nearly 2 meters of linear DNA packed into a 4-micrometer radius), only molecular methods such as Paired-Tag and single-cell (sc)CUT&Tag can generate the necessary high-resolution data (at least for the time being!). As such, foundation models that extrapolate from next-generation sequencing-type data remain of paramount importance for understanding the long-term dynamics of cellular processes.
Sequence-to-Function Models
The sequence-to-function model, which constitutes the most straightforward (though far from simple) type of genomic foundation model, uses DNA sequences alone to predict functional outcomes such as open chromatin, histone modifications, gene expression, DNA methylation, transcription factor binding, and splicing. As all cell types effectively share DNA sequences, the predicted outcomes must differ by cell type, so all sequence-to-function mappings are fan out (i.e., single-input, multi-output).
This post explores the details of three sequence-to-function foundation models (AlphaGenome, EVO-2, and Borzoi) and one earlier model (GENA-LM) and examines their components to compare inputs, outputs, latent representations, architectures, and training methodologies.
As sequence-to-function models, variant effect prediction represents the core application of these models; in particular, predicting the impact of non-coding variants, with derived applications such as plasmid or promoter sequence optimization, quantitative trait locus prioritization, or de novo sequence generation; nevertheless, each foundation model takes a distinct approach to the task of function prediction.
Overview
"Predict what a given DNA sequence does" describes the general area these models fit. A fully-fledged "virtual cell" that provides phenotypic outputs (e.g., proliferation or secretion) may represent the most helpful tool; however, these models aim for an intermediate step: predicting molecular genetic outcomes such as RNA expression, DNA accessibility, chromatin state, or transcription factor binding.
Of note, these recently published models (2024/2025) highlight different main results. GENA-LM highlights an expanded input size compared to prior models (DNABERT/BigBird), strong performance on histone and transcription factor occupancy, and the incorporation of residual memory transformers to boost species classification performance. Borzoi predicts gene expression and open chromatin in a tissue- and species-specific manner from DNA sequence alone, and variants that impact gene expression and alternative polyadenylation. AlphaGenome (heavily inspired by Borzoi) predicts chromatin conformation, handles 1 MB input sequences, and improves variant effect prediction for splicing and gene expression. Evo-2 (strictly speaking, a sequence-to-sequence model) predicts the pathogenicity (including splice-altering mutations), uses embeddings as inputs for fine-tuned output tasks (such as BRCA1 deactivating mutations), and incorporates downstream models to generate sequences with specific chromatin features.
Of note, the field remains in the early days of development, and benchmark metrics (such as correlations and area under the precision–recall curve [AUPRC]), while improving, remain modest. Data constitutes a significant bottleneck for improving model performance and incorporating clinically relevant information. Tissue-level data remains valuable, but not so valuable as cell-type (or cell-subtype) resolved molecular information. Nevertheless, these models reflect the cutting-edge of what we can currently do; a post later in this series will describe the mechanics of fine-tuning open models on new data and potential applications beyond those described in the associated papers.
Continue Reading on our Substack
Our in-depth technical essays now live on Substack -- it's where we publish our full-length thinking.
This post continues there.
Read the complete version on Substack
You can subscribe (free) to get future long-form pieces directly in your inbox.
Stay tuned for our next post, focusing on foundational models of DNA methylation.