Fine-turning Borzoi with Single-Cell Epigenetics

Prior articles focused on current-generation foundation models in epigenetics: sequence-to-function models that learn tissue- and cell-specific epigenetic and expression states and their relationship to underlying DNA sequences, and function-to-function ("smoothing") methylation models that predict methylation states at one locus based on a context window of DNA sequence and DNA methylation states.

This new blog focuses on the practicality of fine-tuning foundation models using new data. As our single-cell CUT&Tag and PairedTag (scCUT&Tag + RNA) assays produce cellular maps of histone modifications, transcription factor binding, and chromatin remodeler occupancy (with or without corresponding RNA expression), fine-tuning Borzoi (a sequence-to-function model) on Paired-Tag maps constitutes a natural use case.

We used a small dataset of human peripheral blood mononuclear cells (PBMCs; publicly available via our data request form) to fine-tune new cell-type-specific output heads for Borzoi, demonstrating the ability to predict functional consequences of mutations in hematological cell epigenetics. This practical article covers instance selection and setup, model loading and extension, data formatting, training, and prediction.

Background

Sequence-to-epigenetics foundation models learn relationships between DNA sequence and cell-specific or tissue-specific epigenetic states. For the current generation of models, cell specificity is not directly encoded in the latent sequence representation (or in the representation-determinative model weights; "encoder"), but instead in the output-specific components of the model ("decoder"). In existing foundation models, specificity stems from the output head weights, and each output head sees the same input representation. An accurate model indicates that the model has learned a set of sequence-derived features "useful" for predicting epigenetic states. The correlation of latent features with established biological genomic features and functional states is often used as evidence that the model captures known biology.

Multi-panel image showing correlation of attention weights and functional sequences
Foundation model attention weights can be found that correlate to functional genomic features. (Top) scDNAm-GPT attention weights at certain layers coincide with H3K4me3 tracks within the same cell type. (Bottom) Attention weights from Evo-2 partitions the genome into biologically relevant sections.

The design of such models makes extending them to new datasets straightforward: adding a new output head corresponds to a new cell-specific predictive model. If the pre-training successfully learned "good" features for sequence-to-function maps, only the new weights require updating, preserving the full model's performance on all other outputs.

Continue Reading on our Substack

Our in-depth technical essays now live on Substack -- it's where we publish our full-length thinking.

This post continues there.

Read the complete version on Substack

You can subscribe (free) to get future long-form pieces directly in your inbox.

Stay tuned for our next post, which discusses some attention alternatives for scaling up genomic foundation models.