← Blog
#interpretability#mechanistic#sparse-autoencoders#transformers

Emergent Computational Gating in Dense Transformers

Dense transformers develop bimodal processing gates at layers 3-4 that nobody designed. Confirmed across three model families. Standard SAEs fail at -3,059% on deep layers; a SipIt + SAE + GLP pipeline recovers them.

We didn't go looking for a routing mechanism in dense transformer models — there isn't one in the architecture. But every token entering Mistral-7B, Mistral-Small-24B, and Qwen3-4B gets routed into one of two processing paths at layers 3-4, and the routing is consistent enough that it's clearly a real circuit, not a statistical artifact.

The core finding

Every token entering the model gets routed into one of two paths:

  • Mode A (93–95% of tokens): shallow processing, minimal transformation
  • Mode B (5–7%): deep processing, massive representational shift

There's no mixture-of-experts routing. There's no explicit gating layer. The model develops this routing during training, on its own.

Cross-architecture confirmation

Model Family Parameters Gate Layer Mode B share
Mistral-7B Mistral 7B L3-4 ~50% of outlier tokens
Mistral-Small-24B Mistral 24B L3-4 47.6% of responses
Qwen3-4B Qwen 4B L3-4 7.27% of tokens

Statistical evidence

Model Evidence
Mistral-7B L2-L3 avg distance +2,170%, std dev +17,750%, avg/median ratio 17.9×
Mistral-Small-24B Cohen's d = 4.8, Silhouette = 0.83, AUC-ROC = 0.97, p < 0.001
Qwen3-4B Feature 374 correlation 0.96 across L3–L5, bimodal ratio 1.09

Causal proof

Ablating the gate at Layer 3 in Qwen3-4B:

Metric Baseline After ablation Change
L6 Mode B mean 308.6 20.5 -93.4%
L6 extreme tokens (>p99.9) 20 0 -100%
L6 max 11,475.4 147.3 -98.7%

Removing the gate removes the downstream behavior. That's the test we wanted — a functional circuit, not a correlation.

Three-stage pipeline (Qwen3-4B, 200K tokens)

Stage Layers Behavior Mode B %
Shallow triage L0–L5 Bimodal gate, selective routing 5–17%
Deep explosion L6–L15 Std dev explodes 4,289%, tokens leave vocabulary space 0.1%
Final routing L16–L35 Steady divergence, second bimodal gate at L35 0.2–10.9%

The L5 compression dip

Every L6-extreme token follows the same trajectory: scores build through L0–L4, drop at L5, then explode 300×+ at L6.

Token ' unwanted': L3=43.4, L4=48.3, L5=31.2, L6=11,409.6
Token ' tops': L3=38.5, L4=45.1, L5=25.7, L6=10,678.6

It's a two-stage funnel: 100% of L6-extreme tokens were L3 Mode B, but only 1.6% of L3 Mode B tokens become L6-extreme. The first gate qualifies; the second commits.

Where standard SAEs break

Sparse autoencoders, the dominant interpretability tool, completely fail on deep computation layers in these models. A pipeline of SipIt invertibility + a pre-trained SAE + GLP diffusion prior recovers what they miss.

Layer Role SAE explained variance Pipeline explained variance
L3 Gate 92.2% 99.99%
L5 Compression 85.2% 99.99%
L6 Explosion -905% 100.0%
L8 Post-explosion -1,111% 100.0%
L16 Deep compute -1,058% 100.0%
L24 Deepest -3,059% 100.0%
L35 Final gate 86.9% 99.97%

The takeaway we drew: deep computation layers use dense / distributed representations, not the sparse features SAEs are built to extract. That's a blind spot in the dominant interpretability approach, not a defect in our SAEs.

Thinking-mode experiment

Qwen3's token does not widen the gate. It amplifies depth.

Layer Baseline Mode B % Think Mode B % Baseline max Think max
L6 0.2% 0.2% 11,679 17,013 (+45.6%)
L35 8.9% 8.1% 3,896 7,272 (+86.7%)

Emotion probe results

Category L6 mean L6 max vs. neutral
Emotion 7,218 10,405 +48%
High valence 6,383 7,639 +31%
Neutral 4,874 5,438 baseline

The gate fires on the first token of the sentence. The model decides at onset whether deep processing is needed.