What Can Epigenomic Foundation Models Reveal About Biology?
Part 2: Modeling DNA Methylation
Our previous post focused on foundation epigenetic models that used DNA sequence (and DNA sequence only!) as input, learning effective nucleotide representations to predict molecular outputs such as RNA expression, open chromatin, and splicing. This subsequent post focuses on models that augment the DNA sequence with DNA methylation– one of the many epigenetic layers – to improve overall modeling of cell type and state. Importantly, the models described in this blog differ from previous models, as their encoded representations incorporate the contextual epigenetic clues critical for predicting methylation status at unseen CpG sites and across cell types.
Function-to-Function Models
A quirk of history bifurcated the world of epigenetics into two camps - DNA methylation and chromatin – and, as such, the landscape of foundation models is no different. Indeed, the sequence-to-function models considered in our last blog included AlphaGenome and GENA-LM, which operate at base-pair (or token) resolution. As such, these models could have incorporated methylation state as an output, but neither model included methylation data as an output head, likely due to scope rather than feasibility. This new blog focuses on a set of foundation models that aim to recapitulate or predict methylation: scMeFormer, scDNAm-GPT, MethylGPT, and CpGPT.
Unlike sequence-to-function models, function-to-function models often take multiple modalities as input whose dimensions may not align; therefore, one must carefully design the network architecture to best leverage these modalities. The models in this overview take different approaches: learning additive representations (enc(DNA) + enc(DNAm)), disjoint representations ((enc(DNA); enc(DNAm))), and combining modalities up front (enc(DNA, DNAm)). Comparing these approaches remains impossible as architectures, datasets, and benchmarks differ across models; instead, we can only summarize how each model performs on its own.
Single-modality output design represents a prominent feature of these models compared to previous models. Compared to Borzoi or AlphaGenome, which attempt to predict a broad spectrum of tissue- and cell-specific epigenetic states from DNA sequences, the models described in this study focused only on DNA methylation states. From an engineering standpoint, there is little practical difference in fine-tuning these models or AlphaGenome to a new modality (i.e., given sequence + methylation, predict expression); in both cases, a new output network must be added and fine-tuned. However, one would expect the methylation models to be slightly "over-adapted" to methylation prediction and to perform better when fine-tuning the whole network rather than freezing the weights and training only the new output head.
The potential for partial DNA methylation constitutes the fundamental difference between methylation-aware data and sequence-only data. In this case and across a population of cells, a particular locus may be fully methylated, fully unmethylated, or anything in between, with the average methylation level called the "beta value." DNA sequences do not share this methylation "spectrum", except in the context of somatic mosaicism (usually in cancer samples); however, methylation state effectively becomes equivalent to a nucleotide at the single-cell level: in humans, only 0, 1, or 2 copies can be methylated, just as only 0, 1, or 2 copies of the genome may harbor a polymorphic allele. Therefore, one may be mildly surprised that, especially for single-cell models, none of these models investigated the expansion of the alphabet to include a methylation character (attempted and dismissed?). These models all learn representations of epigenetic state and sequence context, and all utilize a stack of attention-like layers; however, the architectural approaches remain as diverse as the benchmarks they use.
Overview
All methylation state models start with some variant of masked methylation prediction for pre-training, followed by a round of fine-tuning for a specific task. The approaches bear similarities to expression models such as scGPT, GET, scBERT, and cellFM, but strongly leverage CpG genomic ordering. Like expression models, the range of fine-tuning tasks is donor-level predictions (species, tissue, age) or cell-level predictions (cell type). One could ask questions more directly related to functional engineering, such as the impact of mutations at CpGs, or the predicted impact of targeted demethylation (dCAS-TET1) at nearby CpGs (which may appear in follow-up posts).
Each of the foundation models highlights a different aspect. CpGPT highlights robust performance on imputation and DNA methylation-age prediction, and extension of the chain-of-thought to imputation tasks. MethylGPT focuses on the effectiveness of methylation for predicting disease and treatment response. For single-cell models, scDNAmGPT highlights the MAMBA architecture's training speed, pseudotime inference, and clustering performance, whereas scMeFormer highlights non-degradation in clustering performance with downsampling, heritability enrichment, and imputation-driven DMR inference.
Continue Reading on Substack
Our in-depth and technical essays now live on Substack -- it's where we publish our full-length thinking.
This post continues there.
Read the complete version there
You can subscribe (free) to get future long-form pieces directly in your inbox.