Raven (Part-1)

Memory as a set of Slots

Raven Series  —  Part 1: Memory as a set of Slots  |  Part 2: Architecture and Results →

🐦‍⬛ Ravens are among the most intelligent birds known — famous for caching food across hundreds of locations and retrieving each item with remarkable precision. Just like our new model, which stores information selectively and can retrieve it cleanly when needed!

The Recall Gap

Mamba showed that attention is not strictly necessary for strong language modeling. A simple recurrent model could match, and sometimes outperform, Transformers on language modeling at a fraction of the cost. But there is no free lunch. Soon after, several works exposed a critical limitation: fixed-size memory models struggle on recall benchmarks, which is not surprising, since they have only a finite amount of room to store information. While it’s tempting to blame their finite capacity, the reality is more nuanced.

Consider Needle-In-The-Haystack (NIAH). The model is first given a lookup key, then a long dictionary, and finally asked to retrieve the matching value. This is a form of constant-memory recall: in principle, it requires only $O(1)$ memory, since the model only needs to retain the key and its corresponding value as it scans the sequence (unlike scaling-memory recall tasks like KV-Retrieval, which require $O(T)$ memory because the query comes after the dictionary).

Despite this theoretical $O(1)$ requirement, recurrent models still lag behind Transformers on NIAH. This failure isn’t just a quirk of synthetic benchmarks; it mirrors real-world bottlenecks. In multiple-choice questions, a recurrent model may identify the correct answer early on but fail to preserve that single letter in its state until the output . Similarly, in math and code, recurrent models struggle to track a single variable or a few intermediate values over long distances.

The problem is not a lack of space: the model has enough capacity to store the information. Rather, it lacks the structure needed to organize its state, so new data interferes with what is already stored instead of maintaining its persistence.

From Global States to Partitioned Slots 👕

Can we design a recurrent model with strong constant-memory recall that approaches theoretical limits?

To build intuition, think of the hidden state as a closet. Good organization is not just about how much space you have, but about how that space is structured. When a new item arrives, you usually do two things:

1. Choose where it should go. (Routing)
2. Clear just enough space there to fit it. (Squeezing / Decay)

Memory in recurrent models faces the same dual requirement. A model should decide which part of its state should store the new information, and it should update that part without unnecessarily altering the rest. This motivates the idea of slots: subsets of the hidden state that can be updated independently. With this structure, a new token can modify only the relevant slot(s), while other stored information remains undisturbed.

SSMs write blindly and decay uniformly

State space models (SSMs) and linear attention all share the same fundamental memory operation — a matrix-valued hidden state $S_t \in \mathbb{R}^{M\times d}$ updated by a linear time-dependent recurrence:

\[S_t = \underbrace{S_{t-1}A_t}_{\text{Decay}} + \underbrace{v_t k_t^\top}_{\text{Write}}, \qquad o_t = \underbrace{S_t q_t}_{\text{Read}}\]

SSMs are strong at the reorganization step. The forget gate $A_t$ decays old content to make room for new information. But they do not control where information is written.

In the diagonal case ($A_t=\text{diag}(a_t)$), the update simplifies to $S_t = S_{t-1} \odot a_t + v_t k_t^\top$. Here, the scalar $a_t$ applies uniform decay across the entire state. Because the write $v_t k_t^\top$ lands across all $M$ rows, every new token perturbs everything already stored. There is little isolation between memory contents, so interference accumulates.

SWA writes precisely but forcefully evicts

Sliding window attention (SWA) takes the opposite approach. It maintains a fixed cache and, at each step, $\color{red}{\text{drops}}$ the oldest token to $\color{green}{\text{append}}$ the newest one. Written as a matrix update using a one-hot selector $e_t$:

\[[S^{k}_t, S^v_t] = \underbrace{(\mathbf{1} - e_t)}_{\color{red}{\text{Remove oldest}}}\odot[S^{k}_{t-1}, S^v_{t-1}] + \underbrace{e_t}_{\color{green}{\text{Write new}}} [k_t^\top, v_t^\top]\]

SWA is precise about where it writes: it updates exactly one slot ($e_t$). But it doesn’t reorganize memory well. Instead of compressing or preserving old content, it removes entries outright to make room for new ones.

The Dual Requirement

SSMs and SWA fail in complementary ways. SSMs update memory too diffusely; SWA updates memory too rigidly.

Bridging the recall gap requires a mechanism that can do both: treat subsets of the hidden state as independent slots, and decide which ones should be updated versus preserved. This dual requirement—routing and decay—is the foundation of our framework.

Routing Slot Memories (RSM)

Bridging the gap between “blind writes” and “forced eviction” requires a unifying framework. Instead of writing everywhere (SSM) or writing to a fixed rotation (SWA), an RSM uses a learned sparse router $r_t \in \mathbb{R}^M$.

This router looks at the current token and determines which slots should change and which should remain untouched:

\[S_t = \underbrace{(\mathbf{1} - r_t) \odot S_{t-1}}_{\text{Persistence}} + \underbrace{r_t \odot \bigl(D_t\,S_{t-1}\,A_t + U_t\bigr)}_{\text{Selective Update}}\]

By making $r_t$ sparse, the RSM treats the state as a collection of independent subspaces:

By simply changing the behavior of the router $r_t$, this single equation provides a unified view of several prominent sequence model families:

A dense router ($r_t = \mathbf{1}$) recovers an SSM, where every token updates the full state. A one-hot cyclic router ($r_t = e_t$) recovers SWA, where each token updates a single slot in a fixed rotation.

The question then becomes concrete: can a model learn, purely from data, which slot each token should occupy, and preserve that assignment over time?

Our answer is Raven. By using a learned sparse router, Raven treats its hidden state like a Mixture-of-Experts for memory. It can place critical information into dedicated slots and avoid corrupting those slots with irrelevant filler tokens.

In Part 2, we’ll move from theory to practice: we’ll look at the specific architecture of the Raven block, the “counterintuitive” design decisions that made it work, and how this organized memory allows it to recall information $16\times$ beyond its training length.