LoRA-SP: Rank-Adaptive Fine-Tuning for VLA Models

Demo

Abstract

Vision-language-action (VLA) models are increasingly used for Physical AI, but deploying a pre-trained VLA to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob — the rank — does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (r ∈ {4,8}), while spectral analyses indicate VLAs may require much larger ranks (r ≈ 128) or near-full rank, a mismatch that worsens in multi-task settings.

We present LoRA-SP (Select–Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores E(k) ≥ η, providing a direct link to approximation error. During training, η concentrates energy on a few directions while preserving accuracy, yielding compact adapters that reduce cross-task interference and improve generalization.

On four real-robot manipulation tasks with an unseen AgileX PiPER arm, across π₀ and SmolVLA backbones, LoRA-SP matches or exceeds full fine-tuning and improves multi-task success by up to +31.6% over standard LoRA while remaining robust to rank choice.

Method

LoRA-SP generalizes standard LoRA by replacing the fixed-rank update ΔW = BA with:

ΔW_ℓ(x) = U_ℓ · diag(s_ℓ(x)) · V_ℓ

where U_ℓ, V_ℓ define a shared vector bank (initialized wide at r = 128), and s_ℓ(x) ≥ 0 are singular-value-like scores produced by a lightweight router per input.

Select: Energy-based Active Rank

Select mechanism — Fig. 2 — The effective rank k is chosen as the smallest index where cumulative energy meets threshold η.

Router scores are sorted and the effective rank k is the smallest index satisfying:

E_k(x) = Σ_i=1..k s_i(x)² / Σ_j=1..r s_j(x)² ≥ η

Since E_k(x) bounds the relative Frobenius error as √(1 − E_k(x)), the threshold η is a direct accuracy–efficiency knob. Vectors beyond rank k are zeroed for that input.

Prune: Spectral Concentration Loss

Prune mechanism — Fig. 3 — The spectral loss drives singular-value mass onto a small stable set of directions over training.

ℒ_spec(x) = 1 − E_k(x)

This creates a positive feedback loop: selected vectors are reinforced, their singular values grow, and they dominate future selections. The task loss prevents accuracy collapse, yielding compact adapters without sacrificing performance.

Layer-wise Rank Distribution

Rank distribution per layer — Fig. 4 — LoRA-SP automatically assigns high rank to the vision module and low rank to language and action modules.

Experiments

Setup

Robot: AgileX PiPER (7-DoF), unseen during VLA pretraining
Tasks: Open the Pot · Pour the Block · Press the Button · Pick and Place
Data: 120 demos per task (480 total), dual RGB views
Backbones: π₀ (PaLIGemma-3.5B) and SmolVLA (SmolVLM-2-based)
Baselines: Full FT, LoRA (r ∈ {16,32,64,128}), AdaLoRA, LoRA-MoE

Main Results

Method	π₀ Multi-Task Avg	SmolVLA Multi-Task Avg
Full Fine-Tuning	Best	Best
LoRA (r=128)	Baseline	Baseline
AdaLoRA	≈ LoRA	≈ LoRA
LoRA-MoE (weighted)	≈ LoRA	≈ LoRA
LoRA-SP (ours)	+23.3% over LoRA	+31.6% over LoRA

LoRA-SP matches or exceeds full fine-tuning across tasks while updating significantly fewer parameters and remaining robust to rank choice.

Rank vs. Performance

Citation

@inproceedings{kim2026lorasp,
  author    = {Donghoon Kim and Minji Bae and Unghui Nam and
               Gyeonghun Kim and Suyun Lee and Kyuhong Shim and Byonghyo Shim},
  title     = {Adaptive Capacity Allocation for Vision Language Action Fine-tuning},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2603.07404},
}