Project overview

Published

April 21, 2026

What we’re doing

Single-cell foundation models (scFMs) like AIDO.Cell learn rich cellular representations by masked-reconstruction of gene expression, but their internal computations are opaque. I’m applying Sparse Autoencoders (SAEs) — the same tool that has been used to decompose LLM activations into monosemantic features — to the residual stream of AIDO.Cell. The goal is twofold:

Dissect the latent space into interpretable, biologically meaningful features (immune programs, cytokine responses, cell-type identity modules, …).
Steer model outputs using those features, as a route toward interpretable perturbation prediction.

Why SAEs

Raw scFM neurons are polysemantic — each one mixes many biological concepts (an expected consequence of superposition). SAEs project the activation onto an overcomplete, sparse basis. On AIDO.Cell layer 12 we train a Top-k SAE (k=32, 5120 features, 8× overcomplete) that reconstructs the residual stream while keeping activations sparse:

\[ \hat{z} = \mathrm{TopK}(W_\mathrm{enc}(x - b_\mathrm{pre})), \quad \hat{x} = W_\mathrm{dec}\hat{z} + b_\mathrm{pre}. \]

During steering, we scale a feature’s activation by a factor \(\alpha\) and re-inject the modified reconstruction into the residual stream, preserving the SAE’s error term \(e(x) = x - \mathrm{SAE}(x)\) so non-captured signal still flows forward. The downstream pass continues as if nothing happened, except the feature of interest is amplified or suppressed.

Why AIDO.Cell (and not scGPT / GeneFormer / scFoundation)

Feature interpretation is a combinatorial problem — which sets of genes does this feature activate on? That breaks down if the scFM only sees a gene subset per cell: the feature dictionary then depends on an arbitrary gene-selection choice, and the discoverable programs are a strict subset of full-transcriptome ones. AIDO.Cell models all ~19k human genes per cell, across all layers, which is the property we need. scGPT and GeneFormer subset genes; scFoundation skips zero-expressed genes; scBERT’s embedding quality was sub-par in our tests.

Where this sits in the literature

Two recent works applied SAEs to scFMs (Pedrocchi et al. 2025, Schuster 2025). Both were constrained by the backbone scFM’s architecture or focused on single-feature enrichment without strong causal validation. The additions here are (i) full-transcriptome modelling, (ii) a data-driven adaptive gene selection (PR) instead of arbitrary top-N, (iii) direct causal validation via single-feature steering, and (iv) contrastive multi-feature steering for cell-type reprogramming.

Why this matters

If SAE features correspond to coherent biological programs, they give us:

Mechanistic readouts. Instead of “the model moved the cell” we can say which programs it recruited — cytokine signalling, MHC-I presentation, effector differentiation, and so on.
Controllable perturbations. A feature-level \(\alpha\) vector is a human-readable “recipe” for a phenotypic transition.
Negative results too. When a feature cannot move a gene, or when no single feature encodes a regulatory relationship, we learn what the scFM is (and isn’t) actually representing.

Where the project is

POC phase on PBMC3K. The main threads, all covered in the posts that follow:

Feature space. SAE features get significant GO annotations ~2× more often than raw neurons, and are more semantically specific.
Single-feature steering. Viral defence and B-cell identity features produce symmetric, coherent DE responses on \(\pm\alpha\) perturbations.
Contrastive steering. Gradient descent on a sparse \(\alpha\) vector reprograms CD4 T cells toward the CD8 centroid, recruiting biologically sensible features (IL-12/IL-18, MHC class I, effector cytotoxicity).
Objective matters. Latent-space vs expression-space objectives find almost disjoint solutions — a hint that the scFM’s internal geometry is not the same as its output-expression geometry.
Regulatory logic probe. On TBX21 → targets, perturbations do not propagate preferentially in the TF→target direction. The scFM seems to encode co-expression rather than regulatory causality.

Known limitations (upfront)

The POC is on PBMC3K — circulating immune cells, no tissue-resident or developmental programs. Interpretation leans on Gene Ontology, which is biased toward well-characterised pathways. There’s no cross-model validation: AIDO.Cell was the only accessible scFM satisfying the full-transcriptome-per-cell requirement. Steering closes gaps mostly on low-effect-size genes, and canonical markers like CD8A/CD8B are barely moved by the latent-centroid contrastive objective. The expression-space follow-up (post 5) engages with this directly.