Attention Zoo: Summary of SSMs and Transformers

An interactive guide to modern sequence models, explore architectures and recurrences across the linear-softmax landscape.


Attention Zoo teaser

Intro

When I first started learning about SSMs and linear transformers, I was overwhelmed. From simple linear attention models to Mamba, Kimi Linear, and many others, it was hard to understand how everything connected and where to even begin.

This blog is the guide I wish I had. Its goal is to provide an intuitive overview of linear transformers, SSMs, and the ideas behind modern hybrid architectures without diving too deeply into the math. If you want the technical details, I’ve linked the relevant papers throughout the blog ;)!

I hope this helps make the rapidly evolving world of SSMs and linear models easier to navigate. If there’s a model you’d like to see added or you have suggestions, feel free to reach out — I’ll keep updating this guide as the field evolves.

Modern sequence models share a deep mathematical skeleton: a key-value memory written to at each step and read by a query. The differences lie in how that memory decays, and this page makes those differences interactive and visual.

Filter by attention kernel and memory decay type below. Each model card shows the architecture diagram for that model.

📓 Notation

Throughout this post, vectors are lowercase and matrices are uppercase. The query, key, and value vectors at a single timestep and their full-sequence stacks are:

\[q_t,\, k_t,\, v_t \in \mathbb{R}^d \qquad Q,\, K,\, V \in \mathbb{R}^{T \times d}\]

The attention weight matrix and the SSM hidden state are:

\[A \in \mathbb{R}^{T \times T} \qquad S_t \in \mathbb{R}^{d \times d}\]
The Evolution of Attention Mechanisms
Hover any model for details · 2020–2026
Softmax
Linear
Softmax
Linear
Solid lines = direct lineage  ·  Dashed = architectural influence 20 models · 2020–2026

The ZOO

Readout
Decay Type (exact match, selects unique models)
🔬

No models match this combination. Try a different filter.


📬 Final Note

If you feel that some linear or softmax models are missing from the Zoo, feel free to ping me and I will add them. A sample architecture block template is available — create your model’s block in the same style and send it over. You can reach out on Twitter/X, DMs are open 😉 @rshia_afz.

Also, the same recurrences and rollouts can be applied to the residual stream, resulting in Deep Delta Learning, gating, and attention residuals. Stay tuned, there will soon be another post, or an update to this one, covering upgrades to the residual stream as well.