Steering single features

Published

April 21, 2026

If an SAE feature really encodes a biological program, it should be manipulable: amplifying it should push the associated genes up, and suppressing it should push them down — symmetrically. That’s the first steering test we ran.

Procedure

Pick a feature, scale its natural activation by a factor \(\alpha\) (so \(\alpha=1\) is no-op, \(\alpha=5\) amplifies, \(\alpha=-5\) inverts), add the SAE’s reconstruction error \(e(x)\) back so signal the SAE didn’t capture still flows forward, and let the forward pass continue from layer 12 onward. Formally,

\[ x_\mathrm{steered} = (\boldsymbol{\alpha} \odot \hat{z})\,W_\mathrm{dec} + b_\mathrm{pre} + e(x), \]

with \(\alpha_i = \alpha\) for the chosen feature and 1 elsewhere. Unsteered features pass through unchanged. Differential expression of the steered output vs the baseline then tells us what the feature moved.

Two examples

We picked two features from different connected components of the feature graph, and within each component picked the feature with the fewest annotated GO terms — the most concentrated signal, least likely to mix programs:

F4367: viral defence (top genes enriched for type-I interferon, RSAD2, IFIT family).
F3170: B-cell identity (top genes enriched for BCR signalling, B-cell activation).

For each we swept \(\alpha \in [-5, 5]\) and GO-enriched the top 100 Bonferroni-significant DE genes in each direction.

Viral defence (repressor behaviour)

\(\alpha = 5\) (down-regulated genes)	\(\alpha = -5\) (up-regulated genes)
Defense response to virus	Defense response to virus
Type I interferon signalling	Type I interferon signalling
Neg. reg. of viral genome replication	Neg. reg. of viral process
Cytokine-mediated signalling	Antiviral innate immune response

Acts as a repressor: positive \(\alpha\) suppresses the antiviral program, negative \(\alpha\) activates it. The GO terms mirror each other.

B-cell identity (activator behaviour)

\(\alpha = 2\) (up-regulated genes)	\(\alpha = -2\) (down-regulated genes)
B cell receptor signalling	B cell receptor signalling
B cell activation	B cell activation
Regulation of B cell proliferation	Regulation of B cell proliferation
Antigen receptor-mediated signalling	Regulation of B cell activation

Acts as an activator: positive \(\alpha\) turns on the BCR program, negative \(\alpha\) turns it off. Again, symmetric GO signatures.

Why this matters

Features aren’t just correlative descriptors — they’re operable axes. Moving \(\alpha\) on a single feature reproduces the corresponding biological program in the model’s output. The symmetry of \(+\alpha\) vs \(-\alpha\) says the model isn’t just “aware” of these programs as static labels; it has a continuous representation that can be dialled in either direction.

This also licences the next step: if single features work, we can optimise combinations of them to steer more complex transitions — not just “more/less antiviral” but “CD4 → CD8”.

Procedure

Two examples

Viral defence (repressor behaviour)

B-cell identity (activator behaviour)

Why this matters

Further reading