MODEL ROUTER (\(\mathbf{r}_t\)) CONTENT (\(\mathbf{u}_t\)) DECAY (\(\mathbf{A}_t\)) READOUT (\(\mathbf{o}_t\))
LinAtt \(\mathbf{1}_M\) \(\mathbf{v}_t\mathbf{k}_t^\top\) \(\mathbf{I}\) \(\mathbf{S}_t\mathbf{q}_t\)
RetNet \(\mathbf{1}_M\) \(\mathbf{v}_t\mathbf{k}_t^\top\) \(\gamma\) \(\mathbf{S}_t\mathbf{q}_t\)
GLA \(\mathbf{1}_M\) \(\mathbf{v}_t\mathbf{k}_t^\top\) \(\text{diag}(\sigma(\mathbf{W}\mathbf{x}_t))^{1/\tau}\) \(\mathbf{S}_t\mathbf{q}_t\)
Mamba-2 \(\mathbf{1}_M\) \(\mathbf{v}_t\mathbf{k}_t^\top\) \(a_t\) \(\mathbf{S}_t\mathbf{q}_t\)
GDN \(\mathbf{1}_M\) \(\mathbf{v}_t\mathbf{k}_t^\top\) \(a_t(\mathbf{I} - \mathbf{k}_t\mathbf{k}_t^\top)\) \(\mathbf{S}_t\mathbf{q}_t\)
Raven \({\mathbf{g}_t} / {\mathbf{1}^\top \mathbf{g}_t}\) \([\mathbf{k}_t \ \mathbf{v}_t]^\top\) \(\mathbf{I}\) \((\mathbf{S}^v_t)^\top f(\mathbf{S}^k_t\mathbf{q}_t)\)
GSA \(\mathbf{1}_M - \sigma(\mathbf{W}\mathbf{x}_t)^{1/\tau}\) \([\mathbf{k}_t \ \mathbf{v}_t]^\top\) \(\mathbf{I}\) \((\mathbf{S}^v_t)^\top f(\mathbf{S}^k_t\mathbf{q}_t)\)
ABC \(\text{softmax}(\mathbf{W}\mathbf{x}_t)\) \([\mathbf{k}_t \ \mathbf{v}_t]^\top\) \(\mathbf{I}\) \((\mathbf{S}^v_t)^\top f(\mathbf{S}^k_t\mathbf{q}_t)\)
SWA \(\mathbf{e}_t\) \([\mathbf{k}_t \ \mathbf{v}_t]^\top\) \(\mathbf{I}\) \((\mathbf{S}^v_t)^\top f(\mathbf{S}^k_t\mathbf{q}_t)\)
Table 1: Routing Slot Memories. Design of the Router and Decay components across multiple architectures. The bar on the right along with row colors display each model's spectrum as a function of router sparsity, with SWA and SSM marking the two extremes. For SSMs, the content matrix (\(\mathbf{u}_t\)) is low-rank, whereas SWA-like models use a sparse stacking of \(\mathbf{k}_t\) and \(\mathbf{v}_t\). \(f(\cdot)\) denotes the softmax. Dual-state models (highlighted in red and Raven) apply slot-wise decay via \(\mathbf{D}_t\); therefore, their channel-wise decay \(\mathbf{A}_t\) equals \(\mathbf{I}\).