The Mamba Counterexample | Light of Baldr

If the cluster-collapse pattern we documented in the Topology of Thought work were universal across all neural sequence models, state space models should show it too. They don't. Mamba-370m maintains fragmented representations from input to output, and the cluster count actually grows as you go deeper.

The test

Mamba-370m uses selective state space model (SSM) layers instead of attention. It has:

No query-key-value projections
No attention matrices
No direct token-to-token interaction
Only recurrent state evolution through selective scan

We ran the same persistent homology analysis (Vietoris-Rips via Ripser) on Mamba's hidden states as we'd run on Qwen3-4B and NanoChat.

Results

Metric	Qwen3-4B (attention)	NanoChat (attention)	Mamba-370m (SSM)
Start clusters	519	517	551
End clusters	1	1	987
Direction	Collapse	Collapse	Proliferation
ID profile	Inverted U (peak 9.8)	Inverted U (peak 12.1)	Flat (~3)
H1 loop peak	342	400	None

Mamba's cluster count increases through the network. There's no integration layer, no unified manifold, no topological phase transition.

A three-condition framework

This established a clean causal picture for what produces topological integration:

Condition	Trained transformer	Mamba (SSM)	Untrained transformer
Direct interaction (attention)	Yes	No	Yes
Optimization (gradient descent)	Yes	Yes	No
Topological collapse	Yes	No	No

Both conditions have to be present:

Attention without training (random init) → clusters proliferate (503 → 968)
Training without attention (Mamba) → clusters proliferate (551 → 987)
Attention with training → clusters collapse (517 → 1)

Why attention seems to be the mechanism

Attention creates direct token-to-token interaction. Every token can influence every other token's representation in a single layer. SSM recurrence processes tokens sequentially through a state vector, which acts as a bottleneck. The direct many-to-many interaction that attention provides looks like the thing that enables representational integration into a single manifold.

What we drew from this

Architecture matters for integration. Not all sequence models develop unified representations.
Attention isn't just about parallelism. It provides a qualitatively different kind of computation than sequential state evolution.
SSMs may have different computational tradeoffs — efficient sequence processing without representational integration.
Topology discriminates architectures. The same measurement cleanly separates models that integrate from those that don't.