{"ID":2841251,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12101","arxiv_id":"2511.12101","title":"Decoupled Action Expert: Confining Task Knowledge to the Conditioning Pathway","abstract":"Many recent Vision-Language-Action models employ diffusion or flow-matching backbones with hundreds of millions of parameters for action generation. However, unlike image synthesis where the output spans millions of diverse pixels, a manipulation policy generates only short sequences of low-dimensional, physically correlated action values, a far simpler target that should not demand such capacity. We confirm this intuition and show that task-specific knowledge in these policies can be fully confined to the conditioning pathway, leaving the action backbone task-agnostic. To establish this, we introduce a decoupled training recipe: a general-purpose action head is first pretrained on observation-free forward-kinematics data, then frozen while only the conditioning pathway is trained for downstream tasks. Using Diffusion Policy as a testbed, we show that on both MimicGen and LIBERO, a single frozen backbone shared across all tasks matches normally trained counterparts. This confirms that the action expert encodes little task-specific knowledge. Ablations show that the specific pretraining signal (joint positions, end-effector poses, or no conditioning at all) has no effect on downstream performance, indicating that the backbone learns only general trajectory structure. Pushing this finding further, we replace the 244M U-Net in Diffusion Policy with a 5M-parameter MLP backbone that matches or exceeds its performance, calling into question the large capacity budgets allocated to action generation in current VLA designs.","short_abstract":"Many recent Vision-Language-Action models employ diffusion or flow-matching backbones with hundreds of millions of parameters for action generation. However, unlike image synthesis where the output spans millions of diverse pixels, a manipulation policy generates only short sequences of low-dimensional, physically corr...","url_abs":"https://arxiv.org/abs/2511.12101","url_pdf":"https://arxiv.org/pdf/2511.12101v2","authors":"[\"Jian Zhou\",\"Sihao Lin\",\"Shuai Fu\",\"Zerui Li\",\"Gengze Zhou\",\"Qi WU\"]","published":"2025-11-15T08:39:50Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Diffusion Model\"]","has_code":false}
