ICRA 2026  ·  Vienna, Austria

LoRA-SP: Adaptive Capacity Allocation
for Vision Language Action Fine-tuning

Adaptive Capacity Allocation for Vision Language Action Fine-tuning

Donghoon Kim Minji Bae* Unghui Nam* Gyeonghun Kim* Suyun Lee* Kyuhong Shim† Byonghyo Shim†
Seoul National University (ISLab)  ·  Sungkyunkwan University
* Equal contribution    † Corresponding authors
Paper Code Video

Demo

Abstract

Vision-language-action (VLA) models are increasingly used for Physical AI, but deploying a pre-trained VLA to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob — the rank — does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (r ∈ {4,8}), while spectral analyses indicate VLAs may require much larger ranks (r ≈ 128) or near-full rank, a mismatch that worsens in multi-task settings.

We present LoRA-SP (Select–Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores E(k) ≥ η, providing a direct link to approximation error. During training, η concentrates energy on a few directions while preserving accuracy, yielding compact adapters that reduce cross-task interference and improve generalization.

On four real-robot manipulation tasks with an unseen AgileX PiPER arm, across π₀ and SmolVLA backbones, LoRA-SP matches or exceeds full fine-tuning and improves multi-task success by up to +31.6% over standard LoRA while remaining robust to rank choice.

Method

LoRA-SP Method Overview
Fig. 1 — LoRA-SP replaces fixed-rank LoRA with an input-conditioned SVD-style factorization using a shared vector bank and lightweight router.

LoRA-SP generalizes standard LoRA by replacing the fixed-rank update ΔW = BA with:

ΔW(x) = U · diag(s(x)) · V

where U, V define a shared vector bank (initialized wide at r = 128), and s(x) ≥ 0 are singular-value-like scores produced by a lightweight router per input.

Select: Energy-based Active Rank

Select mechanism
Fig. 2 — The effective rank k is chosen as the smallest index where cumulative energy meets threshold η.

Router scores are sorted and the effective rank k is the smallest index satisfying:

Ek(x) = Σi=1..k si(x)² / Σj=1..r sj(x)² ≥ η

Since Ek(x) bounds the relative Frobenius error as √(1 − Ek(x)), the threshold η is a direct accuracy–efficiency knob. Vectors beyond rank k are zeroed for that input.

Prune: Spectral Concentration Loss

Prune mechanism
Fig. 3 — The spectral loss drives singular-value mass onto a small stable set of directions over training.
spec(x) = 1 − Ek(x)

This creates a positive feedback loop: selected vectors are reinforced, their singular values grow, and they dominate future selections. The task loss prevents accuracy collapse, yielding compact adapters without sacrificing performance.

Layer-wise Rank Distribution

Rank distribution per layer
Fig. 4 — LoRA-SP automatically assigns high rank to the vision module and low rank to language and action modules.

Experiments

Experiment Setup
Fig. 5 — Real-world setup: four manipulation tasks on AgileX PiPER (7-DoF), 480 demonstrations total.

Setup

  • Robot: AgileX PiPER (7-DoF), unseen during VLA pretraining
  • Tasks: Open the Pot · Pour the Block · Press the Button · Pick and Place
  • Data: 120 demos per task (480 total), dual RGB views
  • Backbones: π₀ (PaLIGemma-3.5B) and SmolVLA (SmolVLM-2-based)
  • Baselines: Full FT, LoRA (r ∈ {16,32,64,128}), AdaLoRA, LoRA-MoE

Main Results

Method π₀ Multi-Task Avg SmolVLA Multi-Task Avg
Full Fine-TuningBestBest
LoRA (r=128)BaselineBaseline
AdaLoRA≈ LoRA≈ LoRA
LoRA-MoE (weighted)≈ LoRA≈ LoRA
LoRA-SP (ours)+23.3% over LoRA+31.6% over LoRA

LoRA-SP matches or exceeds full fine-tuning across tasks while updating significantly fewer parameters and remaining robust to rank choice.

Rank vs. Performance

Rank vs Performance
Fig. 6 — VLA transfer requires far higher intrinsic rank than language fine-tuning, motivating rank-adaptive methods.

Citation

@inproceedings{kim2026lorasp,
  author    = {Donghoon Kim and Minji Bae and Unghui Nam and
               Gyeonghun Kim and Suyun Lee and Kyuhong Shim and Byonghyo Shim},
  title     = {Adaptive Capacity Allocation for Vision Language Action Fine-tuning},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2603.07404},
}