← Blog
#interpretability#topology#state-space-models#falsification

The Mamba Counterexample

If topological integration were universal, state space models should show it too. They don’t. Mamba-370m maintains fragmented representations end-to-end.

If the cluster-collapse pattern we documented in the Topology of Thought work were universal across all neural sequence models, state space models should show it too. They don't. Mamba-370m maintains fragmented representations from input to output, and the cluster count actually grows as you go deeper.

The test

Mamba-370m uses selective state space model (SSM) layers instead of attention. It has:

  • No query-key-value projections
  • No attention matrices
  • No direct token-to-token interaction
  • Only recurrent state evolution through selective scan

We ran the same persistent homology analysis (Vietoris-Rips via Ripser) on Mamba's hidden states as we'd run on Qwen3-4B and NanoChat.

Results

Metric Qwen3-4B (attention) NanoChat (attention) Mamba-370m (SSM)
Start clusters 519 517 551
End clusters 1 1 987
Direction Collapse Collapse Proliferation
ID profile Inverted U (peak 9.8) Inverted U (peak 12.1) Flat (~3)
H1 loop peak 342 400 None

Mamba's cluster count increases through the network. There's no integration layer, no unified manifold, no topological phase transition.

A three-condition framework

This established a clean causal picture for what produces topological integration:

Condition Trained transformer Mamba (SSM) Untrained transformer
Direct interaction (attention) Yes No Yes
Optimization (gradient descent) Yes Yes No
Topological collapse Yes No No

Both conditions have to be present:

  • Attention without training (random init) → clusters proliferate (503 → 968)
  • Training without attention (Mamba) → clusters proliferate (551 → 987)
  • Attention with training → clusters collapse (517 → 1)

Why attention seems to be the mechanism

Attention creates direct token-to-token interaction. Every token can influence every other token's representation in a single layer. SSM recurrence processes tokens sequentially through a state vector, which acts as a bottleneck. The direct many-to-many interaction that attention provides looks like the thing that enables representational integration into a single manifold.

What we drew from this

  1. Architecture matters for integration. Not all sequence models develop unified representations.
  2. Attention isn't just about parallelism. It provides a qualitatively different kind of computation than sequential state evolution.
  3. SSMs may have different computational tradeoffs — efficient sequence processing without representational integration.
  4. Topology discriminates architectures. The same measurement cleanly separates models that integrate from those that don't.