Emergent Computational Gating in Dense Transformers

We didn't go looking for a routing mechanism in dense transformer models — there isn't one in the architecture. But every token entering Mistral-7B, Mistral-Small-24B, and Qwen3-4B gets routed into one of two processing paths at layers 3-4, and the routing is consistent enough that it's clearly a real circuit, not a statistical artifact.

The core finding

Every token entering the model gets routed into one of two paths:

Mode A (93–95% of tokens): shallow processing, minimal transformation
Mode B (5–7%): deep processing, massive representational shift

There's no mixture-of-experts routing. There's no explicit gating layer. The model develops this routing during training, on its own.

Cross-architecture confirmation

Model	Family	Parameters	Gate Layer	Mode B share
Mistral-7B	Mistral	7B	L3-4	~50% of outlier tokens
Mistral-Small-24B	Mistral	24B	L3-4	47.6% of responses
Qwen3-4B	Qwen	4B	L3-4	7.27% of tokens

Statistical evidence

Model	Evidence
Mistral-7B	L2-L3 avg distance +2,170%, std dev +17,750%, avg/median ratio 17.9×
Mistral-Small-24B	Cohen's d = 4.8, Silhouette = 0.83, AUC-ROC = 0.97, p < 0.001
Qwen3-4B	Feature 374 correlation 0.96 across L3–L5, bimodal ratio 1.09

Causal proof

Ablating the gate at Layer 3 in Qwen3-4B:

Metric	Baseline	After ablation	Change
L6 Mode B mean	308.6	20.5	-93.4%
L6 extreme tokens (>p99.9)	20	0	-100%
L6 max	11,475.4	147.3	-98.7%

Removing the gate removes the downstream behavior. That's the test we wanted — a functional circuit, not a correlation.

Three-stage pipeline (Qwen3-4B, 200K tokens)

Stage	Layers	Behavior	Mode B %
Shallow triage	L0–L5	Bimodal gate, selective routing	5–17%
Deep explosion	L6–L15	Std dev explodes 4,289%, tokens leave vocabulary space	0.1%
Final routing	L16–L35	Steady divergence, second bimodal gate at L35	0.2–10.9%

The L5 compression dip

Every L6-extreme token follows the same trajectory: scores build through L0–L4, drop at L5, then explode 300×+ at L6.

Token ' unwanted': L3=43.4, L4=48.3, L5=31.2, L6=11,409.6
Token ' tops': L3=38.5, L4=45.1, L5=25.7, L6=10,678.6

It's a two-stage funnel: 100% of L6-extreme tokens were L3 Mode B, but only 1.6% of L3 Mode B tokens become L6-extreme. The first gate qualifies; the second commits.

Where standard SAEs break

Sparse autoencoders, the dominant interpretability tool, completely fail on deep computation layers in these models. A pipeline of SipIt invertibility + a pre-trained SAE + GLP diffusion prior recovers what they miss.

Layer	Role	SAE explained variance	Pipeline explained variance
L3	Gate	92.2%	99.99%
L5	Compression	85.2%	99.99%
L6	Explosion	-905%	100.0%
L8	Post-explosion	-1,111%	100.0%
L16	Deep compute	-1,058%	100.0%
L24	Deepest	-3,059%	100.0%
L35	Final gate	86.9%	99.97%

The takeaway we drew: deep computation layers use dense / distributed representations, not the sparse features SAEs are built to extract. That's a blind spot in the dominant interpretability approach, not a defect in our SAEs.

Thinking-mode experiment

Qwen3's token does not widen the gate. It amplifies depth.

Layer	Baseline Mode B %	Think Mode B %	Baseline max	Think max
L6	0.2%	0.2%	11,679	17,013 (+45.6%)
L35	8.9%	8.1%	3,896	7,272 (+86.7%)

Emotion probe results

Category	L6 mean	L6 max	vs. neutral
Emotion	7,218	10,405	+48%
High valence	6,383	7,639	+31%
Neutral	4,874	5,438	baseline

The gate fires on the first token of the sentence. The model decides at onset whether deep processing is needed.