LION 🦁 Part IV - Results

Comprehensive results on Vision, MLM and more LION variants

[Paper] [Code]

  1. Part I - Full Linear Attention
  2. Part II - Bi-directional RNN
  3. Part III - Chunkwise Parallel from of LION
  4. Part IV - Results

In the final part of our LION series, we will present and discuss a selection of experimental results across various domains, including vision tasks, masked language modeling (MLM), and different LION architectures. These results not only highlight LION’s versatility and efficiency across diverse applications but also serve as a preview of the comprehensive findings detailed in the full paper.

Image Classification Performance Overview

Model Comparisons

We evaluated LION’s performance, efficiency, and training times against state-of-the-art SSMs and Transformers for image classification. The results demonstrate that LION achieves competitive performance while offering significant advantages in training speed and efficiency.

Model #Param Imagenet Top-1 Acc. Train. time
$\text{ViT}$ 86M $77.9$ $\times 1$
$\text{DeiT}$ 86M $\underline{81.8}$ $\times 1$
$\text{Hydra}$ 104M $81.0$ $\times 2.51$
$\text{Vim}$ 98M $\mathbf{81.9}$ $\times 10.86$
$\text{LION-}\text{🔥}$ 86M $74.7$ $\mathbf{\times 0.73}$
$\text{LION-D}$ 86M $77.8$ $\times \underline{1.39}$
$\text{LION-D}^{\natural}$ 86M $80.2$ $\times 1.48$
$\text{LION-S}$ 86M $76.3$ $\times 1.46$
$\text{LION-S}^{\natural}$ 86M $79.9$ $\times 1.68$
Model performance comparison on ImageNet classification, showing parameter count, top-1 accuracy, and relative training time.

As shown in the table above, LION models achieve competitive performance with vision-specific SSMs like Vim, while being significantly faster during training. LION-D performs comparably to Vim and surpasses Hydra , while training approximately 7x faster than Vim . Notably, LION-🔥 demonstrates the highest training speed across all models, showing that training with Full Linear Attention is significantly faster than chunkwise parallel training (used in Hydra) and considerably faster than the scan algorithm, even with optimized GPU kernels (as used in Vim). \(LION-S^{\natural}\) and \(LION-D^{\natural}\) modify the order of patches in an image to better capture the locality inherent in spatial patterns. By rearranging the patch sequence, these models enhance their understanding of local structures while still leveraging the efficiency of Linear Attention mechanisms similar to xLSTM .

Memory Efficiency

The LION family demonstrates excellent memory efficiency across both vision and language tasks. Figure below shows inference memory usage with a batch size of 64 across different image resolutions, LION models (RNN form) maintain reasonable memory consumption even at high resolutions up to 2496 pixels, while adding minimal training overhead in BERT-style language modeling scenarios. In contrast, baseline models like ViT and DeiT run out of memory (OOM) at much lower resolutions.

Memory usage during inference across different architectures with batch size 64. LION models (RNN form) maintain reasonable memory consumption at high resolutions while other models run out of memory.

Training Time Analysis

The LION family demonstrates remarkable training efficiency across both vision and language tasks. As shown in the table below, LION variants add minimal training overhead compared to SSMs.

Task LION-🔥 LION-D LION-S Hydra Vim
Vision $\times 0.73$ $\times 1.39$ $\times 1.46$ $\times 2.51$ $\times 10.86$
MLM $\times 0.95$ $\times 1.10$ $\times 1.32$ $\times 3.13$
Training Times (relative to Transformer) ↓

For vision tasks, LION-🔥 achieves remarkable speed, training 27% faster than standard vision Transformers . Even the more complex LION variants maintain competitive training times, with LION-D and LION-S training only ~1.4x slower than Transformers. This is significantly better than competing approaches like Hydra (2.51x slower) and Vim (10.86x slower).

In MLM tasks, the efficiency gains are even more pronounced. LION-🔥 nearly matches Transformer training speed at just 0.95x, while LION-D adds only 10% overhead. Even LION-S remains efficient at 1.32x. All LION variants significantly outperform Hydra’s 3.13x slowdown, while Vim is not applicable to MLM tasks (marked as ✗).

MLM Results

For masked language modeling (MLM) tasks, we evaluated LION models against BERT and Hydra on both MLM pretraining and GLUE benchmark finetuning. The results show that LION variants achieve competitive performance while maintaining good training efficiency.

Model MLM Acc. GLUE Train. time
BERT $\underline{69.88}$ $\mathbf{82.95}$ $\times 1$
Hydra $\mathbf{71.18}$ $\underline{81.77}$ $\times 3.13$
LION-🔥 $67.11$ $80.76$ $\times \mathbf{0.95}$
LION-D $68.64$ $81.34$ $\times \underline{1.10}$
LION-S $69.16$ $81.58$ $\times 1.32$
C4 MLM and GLUE results for the LARGE scale (334M). For each dataset, the best and second best results are highlighted with bold and underline respectively.

LION Architecture Variants and Trade-offs

Let’s explore how different LION variants handle the trade-off between memory usage and inference speed. We will look at three key approaches:

  1. Full Linear Attention - The standard approach using the Full Attention matrix.
  2. Bidirectional RNN - Our memory-efficient RNN formulation.
  3. LION Chunk - A balanced approach using chunked computation.

Memory vs Speed Trade-offs

The first plot below shows how these approaches compare in terms of memory efficiency and inference speed in LION-D. The RNN approach proves to be the most memory-efficient, while Full Attention uses the most memory. LION Chunk provides a nice middle ground - it uses less memory than Full Attention while actually achieving faster inference speeds than both alternatives. This makes it particularly attractive when you need to balance performance with resource constraints.

Analysis of how chunk size affects model performance across different LION-D variants.

For LION-🔥, we see a similar pattern, but the chunking approach is even more pronounced.

Evaluation of linear chunking strategies and their impact on model efficiency of LION-🔥.

Lastly for LION-S, we see that the chunking approach is only faster at lower resolutions - at higher resolutions, the overhead from mask calculations starts to slow it down.

Performance comparison of selective chunking approaches across different sequence lengths with LION-S.

Future Directions

Last Points

We encourage the readers of this blog post to read the full paper for more details about the LION framework and experimental setups. The implementation details are available in the code repository.

If you use this work, please consider citing the paper:

@article{afzal2025linear,
  title={Linear Attention for Efficient Bidirectional Sequence Modeling},
  author={Afzal, Arshia and Abad Rocamora, Elias and Candogan, Leyla Naz and Puigdemont, Pol and Tonin, Francesco and Wu, Yongtao and Shoaran, Mahsa and Cevher, Volkan},
  journal={arXiv preprint arXiv:2502.16249},
  year={2025},
  url={https://arxiv.org/abs/2502.16249},
  doi={10.48550/arXiv.2502.16249}
}