Routing Slot Memories Table

MODEL	ROUTER (\(\mathbf{r}_t\))	CONTENT (\(\mathbf{u}_t\))	DECAY (\(\mathbf{A}_t\))	READOUT (\(\mathbf{o}_t\))
LinAtt	\(\mathbf{1}_M\)	\(\mathbf{v}_t\mathbf{k}_t^\top\)	\(\mathbf{I}\)	\(\mathbf{S}_t\mathbf{q}_t\)
RetNet	\(\mathbf{1}_M\)	\(\mathbf{v}_t\mathbf{k}_t^\top\)	\(\gamma\)	\(\mathbf{S}_t\mathbf{q}_t\)
GLA	\(\mathbf{1}_M\)	\(\mathbf{v}_t\mathbf{k}_t^\top\)	\(\text{diag}(\sigma(\mathbf{W}\mathbf{x}_t))^{1/\tau}\)	\(\mathbf{S}_t\mathbf{q}_t\)
Mamba-2	\(\mathbf{1}_M\)	\(\mathbf{v}_t\mathbf{k}_t^\top\)	\(a_t\)	\(\mathbf{S}_t\mathbf{q}_t\)
GDN	\(\mathbf{1}_M\)	\(\mathbf{v}_t\mathbf{k}_t^\top\)	\(a_t(\mathbf{I} - \mathbf{k}_t\mathbf{k}_t^\top)\)	\(\mathbf{S}_t\mathbf{q}_t\)
Raven	\({\mathbf{g}_t} / {\mathbf{1}^\top \mathbf{g}_t}\)	\([\mathbf{k}_t \ \mathbf{v}_t]^\top\)	\(\mathbf{I}\)	\((\mathbf{S}^v_t)^\top f(\mathbf{S}^k_t\mathbf{q}_t)\)
GSA	\(\mathbf{1}_M - \sigma(\mathbf{W}\mathbf{x}_t)^{1/\tau}\)	\([\mathbf{k}_t \ \mathbf{v}_t]^\top\)	\(\mathbf{I}\)	\((\mathbf{S}^v_t)^\top f(\mathbf{S}^k_t\mathbf{q}_t)\)
ABC	\(\text{softmax}(\mathbf{W}\mathbf{x}_t)\)	\([\mathbf{k}_t \ \mathbf{v}_t]^\top\)	\(\mathbf{I}\)	\((\mathbf{S}^v_t)^\top f(\mathbf{S}^k_t\mathbf{q}_t)\)
SWA	\(\mathbf{e}_t\)	\([\mathbf{k}_t \ \mathbf{v}_t]^\top\)	\(\mathbf{I}\)	\((\mathbf{S}^v_t)^\top f(\mathbf{S}^k_t\mathbf{q}_t)\)

Table 1: Routing Slot Memories. Design of the Router and Decay components across multiple architectures. The bar on the right along with row colors display each model's spectrum as a function of router sparsity, with SWA and SSM marking the two extremes. For SSMs, the content matrix (\(\mathbf{u}_t\)) is low-rank, whereas SWA-like models use a sparse stacking of \(\mathbf{k}_t\) and \(\mathbf{v}_t\). \(f(\cdot)\) denotes the softmax. Dual-state models (highlighted in red and Raven) apply slot-wise decay via \(\mathbf{D}_t\); therefore, their channel-wise decay \(\mathbf{A}_t\) equals \(\mathbf{I}\).