If the cluster-collapse pattern we documented in the Topology of Thought work were universal across all neural sequence models, state space models should show it too. They don't. Mamba-370m maintains fragmented representations from input to output, and the cluster count actually grows as you go deeper.
The test
Mamba-370m uses selective state space model (SSM) layers instead of attention. It has:
- No query-key-value projections
- No attention matrices
- No direct token-to-token interaction
- Only recurrent state evolution through selective scan
We ran the same persistent homology analysis (Vietoris-Rips via Ripser) on Mamba's hidden states as we'd run on Qwen3-4B and NanoChat.
Results
| Metric | Qwen3-4B (attention) | NanoChat (attention) | Mamba-370m (SSM) |
|---|---|---|---|
| Start clusters | 519 | 517 | 551 |
| End clusters | 1 | 1 | 987 |
| Direction | Collapse | Collapse | Proliferation |
| ID profile | Inverted U (peak 9.8) | Inverted U (peak 12.1) | Flat (~3) |
| H1 loop peak | 342 | 400 | None |
Mamba's cluster count increases through the network. There's no integration layer, no unified manifold, no topological phase transition.
A three-condition framework
This established a clean causal picture for what produces topological integration:
| Condition | Trained transformer | Mamba (SSM) | Untrained transformer |
|---|---|---|---|
| Direct interaction (attention) | Yes | No | Yes |
| Optimization (gradient descent) | Yes | Yes | No |
| Topological collapse | Yes | No | No |
Both conditions have to be present:
- Attention without training (random init) → clusters proliferate (503 → 968)
- Training without attention (Mamba) → clusters proliferate (551 → 987)
- Attention with training → clusters collapse (517 → 1)
Why attention seems to be the mechanism
Attention creates direct token-to-token interaction. Every token can influence every other token's representation in a single layer. SSM recurrence processes tokens sequentially through a state vector, which acts as a bottleneck. The direct many-to-many interaction that attention provides looks like the thing that enables representational integration into a single manifold.
What we drew from this
- Architecture matters for integration. Not all sequence models develop unified representations.
- Attention isn't just about parallelism. It provides a qualitatively different kind of computation than sequential state evolution.
- SSMs may have different computational tradeoffs — efficient sequence processing without representational integration.
- Topology discriminates architectures. The same measurement cleanly separates models that integrate from those that don't.