Steering single features
If an SAE feature really encodes a biological program, it should be manipulable: amplifying it should push the associated genes up, and suppressing it should push them down — symmetrically. That’s the first steering test we ran.
Procedure
Pick a feature, scale its natural activation by a factor \(\alpha\) (so \(\alpha=1\) is no-op, \(\alpha=5\) amplifies, \(\alpha=-5\) inverts), add the SAE’s reconstruction error \(e(x)\) back so signal the SAE didn’t capture still flows forward, and let the forward pass continue from layer 12 onward. Formally,
\[ x_\mathrm{steered} = (\boldsymbol{\alpha} \odot \hat{z})\,W_\mathrm{dec} + b_\mathrm{pre} + e(x), \]
with \(\alpha_i = \alpha\) for the chosen feature and 1 elsewhere. Unsteered features pass through unchanged. Differential expression of the steered output vs the baseline then tells us what the feature moved.
Two examples
We picked two features from different connected components of the feature graph, and within each component picked the feature with the fewest annotated GO terms — the most concentrated signal, least likely to mix programs:
- F4367: viral defence (top genes enriched for type-I interferon, RSAD2, IFIT family).
- F3170: B-cell identity (top genes enriched for BCR signalling, B-cell activation).
For each we swept \(\alpha \in [-5, 5]\) and GO-enriched the top 100 Bonferroni-significant DE genes in each direction.
B-cell identity (activator behaviour)
| \(\alpha = 2\) (up-regulated genes) | \(\alpha = -2\) (down-regulated genes) |
|---|---|
| B cell receptor signalling | B cell receptor signalling |
| B cell activation | B cell activation |
| Regulation of B cell proliferation | Regulation of B cell proliferation |
| Antigen receptor-mediated signalling | Regulation of B cell activation |
Acts as an activator: positive \(\alpha\) turns on the BCR program, negative \(\alpha\) turns it off. Again, symmetric GO signatures.
Why this matters
Features aren’t just correlative descriptors — they’re operable axes. Moving \(\alpha\) on a single feature reproduces the corresponding biological program in the model’s output. The symmetry of \(+\alpha\) vs \(-\alpha\) says the model isn’t just “aware” of these programs as static labels; it has a continuous representation that can be dialled in either direction.
This also licences the next step: if single features work, we can optimise combinations of them to steer more complex transitions — not just “more/less antiviral” but “CD4 → CD8”.
Further reading
reports/steering_results.md— full DE tables, all three initial experiments (viral, B-cell, neutrophil), including the polysemantic edge cases.