FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment

📌 Poster Session
🗓️ Wed, Jul 8, 2026 • 🕝 2:30–4:15 PM KST • 📍 Hall A #2208
🎤 AdaptFM Workshop
🗓️ Sat, Jul 11, 2026 • 🕟 2:30–3:15 PM KST • 📍 ASEM Ballroom 202
🎤 ColorAI Workshop
🗓️ Sat, Jul 11, 2026 • 🕗 3:15 AM–4:30 PM KST (in afternoon seesion) • 📍 Room 327

Modern LLMs and Vision Transformers are usually deployed as fixed computational monoliths: one model, one cost, one accuracy point. FlexRank changes this by turning a pretrained model into a family of nested low-rank submodels that share the same weights.

🚀 Train once, deploy everywhere: choose the rank budget at inference time.
🧩 One shared model, many sizes: smaller submodels are nested inside larger ones.
🎯 Budget-aware rank allocation: FlexRank spends parameters where they matter most.
⚡ Real inference savings: Gauge-Aligned Reparametrization makes low-rank deployment practical.

FlexRank is built around a simple deployment question: if we can only afford part of the model, which part should we keep? A uniform rank cut is too crude, because different layers and modules contribute differently to the final prediction. FlexRank instead learns an ordered decomposition of the pretrained model and uses it to build a budget-aware hierarchy.

🧱 Step 1 - Decompose: each linear layer is factorized into low-rank components ordered by importance.
🧭 Step 2 - Search: dynamic programming decides how many components to keep in each layer for each budget.
🔁 Step 3 - Consolidate: sampled submodels are distilled from the original model so all budgets work well together.

The result is not a collection of separately trained compressed models. It is one elastic model whose smaller configurations are nested inside the larger ones, making the accuracy-cost curve smooth and easy to deploy.

The main idea is simple: elastic models should not be a bag of unrelated submodels. They should form a clean hierarchy where larger models refine the components reused by smaller ones.

❌ Post-training selection: good full model, weak smaller models.
❌ All-submodel training: too much interference between rank choices.
✅ Nested submodel training: compatible budgets, shared knowledge, Pareto-efficient behavior.

FlexRank delivers smooth accuracy-cost trade-offs across LLMs and Vision Transformers. It consistently improves over SVD, DataSVD, and ACIP-style low-rank elastic baselines.

🦙 Llama models: graceful degradation over many parameter budgets.
🖼️ DINOv3 ViTs: strong ImageNet1K accuracy even after large parameter reductions.
🏁 Beyond low-rank baselines: competitive with structured pruning and depth-elastic methods.

The ablations show that FlexRank is more than SVD plus training: both the rank allocation and the nested training procedure are doing important work.

🔍 Rank profiles are non-uniform. In the heatmaps below, each column corresponds to a GPT-2 module and each row to a target model size. If uniform compression were enough, the heatmaps would look almost flat. Instead, FlexRank preserves more capacity in specific modules, showing that the dynamic-programming search identifies where rank is most valuable.

📈 Initialization alone does not solve elasticity. In Figure 7(a), the DataSVD curves with 256 and 1024 calibration samples almost overlap, showing that a small calibration set is already enough to estimate the layer-wise decomposition. However, the loss remains far from the original model at smaller parameter counts, so better initialization alone is not enough.

🧠 Joint nested training is the key consolidation step. Figure 7(b) shows that independently adapting each layer is still much weaker than training the selected submodels end-to-end: the model needs to repair cross-layer interactions, not only local reconstruction errors. Figure 8 then isolates the role of budget sampling: each single-budget model is strong near the budget it was trained for, but fails to trace the full Pareto curve. FlexRank stays close to the best curve because it distills many nested budgets into the same shared weights.

FlexRank makes low-rank compression elastic: a single pretrained model becomes a family of deployable submodels, each selected by budget and backed by shared nested weights.

✨ One model. Many budgets.
⚡ Less compute. Smooth degradation.
🌍 Adaptive deployment across heterogeneous hardware, latency, and memory constraints.

How to cite us


      @inproceedings{
      zaccone2026flexrank,
      title={FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment},
      author={Zaccone, Riccardo and Laskaridis, Stefanos and Ciccone, Marco and Horvath, Samuel},
      booktitle={Forty-third International Conference on Machine Learning},
      year={2026},
      url={https://openreview.net/forum?id=DK0kvnNelx}
      }

FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment

ICML 2026 (Spotlight)

📢 Latest News

About this work

Method

Why Nestedness Matters

Main Results

Ablations

Conclusions

How to cite us