Memory as a set of Slots
đŚâ⏠Ravens are among the most intelligent birds known â famous for caching food across hundreds of locations and retrieving each item with remarkable precision. Just like our new model, which stores information selectively and can retrieve it cleanly when needed!
Mamba
Consider Needle-In-The-Haystack (NIAH). The model is first given a lookup key, then a long dictionary, and finally asked to retrieve the matching value. This is a form of constant-memory recall: in principle, it requires only $O(1)$ memory, since the model only needs to retain the key and its corresponding value as it scans the sequence (unlike scaling-memory recall tasks like KV-Retrieval, which require $O(T)$ memory because the query comes after the dictionary).
Despite this theoretical $O(1)$ requirement, recurrent models still lag behind Transformers on NIAH. This failure isnât just a quirk of synthetic benchmarks; it mirrors real-world bottlenecks. In multiple-choice questions, a recurrent model may identify the correct answer early on but fail to preserve that single letter in its state until the output
The problem is not a lack of space: the model has enough capacity to store the information. Rather, it lacks the structure needed to organize its state, so new data interferes with what is already stored instead of maintaining its persistence.
Can we design a recurrent model with strong constant-memory recall that approaches theoretical limits?
To build intuition, think of the hidden state as a closet. Good organization is not just about how much space you have, but about how that space is structured. When a new item arrives, you usually do two things:
1. Choose where it should go. (Routing)
2. Clear just enough space there to fit it. (Squeezing / Decay)
Memory in recurrent models faces the same dual requirement. A model should decide which part of its state should store the new information, and it should update that part without unnecessarily altering the rest. This motivates the idea of slots: subsets of the hidden state that can be updated independently. With this structure, a new token can modify only the relevant slot(s), while other stored information remains undisturbed.
State space models (SSMs)
SSMs are strong at the reorganization step. The forget gate $A_t$ decays old content to make room for new information. But they do not control where information is written.
In the diagonal case ($A_t=\text{diag}(a_t)$), the update simplifies to $S_t = S_{t-1} \odot a_t + v_t k_t^\top$. Here, the scalar $a_t$ applies uniform decay across the entire state. Because the write $v_t k_t^\top$ lands across all $M$ rows, every new token perturbs everything already stored. There is little isolation between memory contents, so interference accumulates.
Sliding window attention (SWA) takes the opposite approach. It maintains a fixed cache and, at each step, $\color{red}{\text{drops}}$ the oldest token to $\color{green}{\text{append}}$ the newest one. Written as a matrix update using a one-hot selector $e_t$:
\[[S^{k}_t, S^v_t] = \underbrace{(\mathbf{1} - e_t)}_{\color{red}{\text{Remove oldest}}}\odot[S^{k}_{t-1}, S^v_{t-1}] + \underbrace{e_t}_{\color{green}{\text{Write new}}} [k_t^\top, v_t^\top]\]SWA is precise about where it writes: it updates exactly one slot ($e_t$). But it doesnât reorganize memory well. Instead of compressing or preserving old content, it removes entries outright to make room for new ones.
SSMs and SWA fail in complementary ways. SSMs update memory too diffusely; SWA updates memory too rigidly.
Bridging the recall gap requires a mechanism that can do both: treat subsets of the hidden state as independent slots, and decide which ones should be updated versus preserved. This dual requirementârouting and decayâis the foundation of our framework.
Bridging the gap between âblind writesâ and âforced evictionâ requires a unifying framework. Instead of writing everywhere (SSM) or writing to a fixed rotation (SWA), an RSM uses a learned sparse router $r_t \in \mathbb{R}^M$.
This router looks at the current token and determines which slots should change and which should remain untouched:
\[S_t = \underbrace{(\mathbf{1} - r_t) \odot S_{t-1}}_{\text{Persistence}} + \underbrace{r_t \odot \bigl(D_t\,S_{t-1}\,A_t + U_t\bigr)}_{\text{Selective Update}}\]By making $r_t$ sparse, the RSM treats the state as a collection of independent subspaces:
By simply changing the behavior of the router $r_t$, this single equation provides a unified view of several prominent sequence model families:
A dense router ($r_t = \mathbf{1}$) recovers an SSM, where every token updates the full state. A one-hot cyclic router ($r_t = e_t$) recovers SWA, where each token updates a single slot in a fixed rotation.
The question then becomes concrete: can a model learn, purely from data, which slot each token should occupy, and preserve that assignment over time?
Our answer is Raven. By using a learned sparse router, Raven treats its hidden state like a Mixture-of-Experts
In Part 2, weâll move from theory to practice: weâll look at the specific architecture of the Raven block, the âcounterintuitiveâ design decisions that made it work, and how this organized memory allows it to recall information $16\times$ beyond its training length.