{"ID":3084515,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-06T15:28:14.12845936Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05254","arxiv_id":"2606.05254","title":"Flash-WAM: Modality-Aware Distillation for World Action Models","abstract":"World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \\textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\\%$ RoboTwin 2.0, $95.7\\%$ LIBERO) and substantially recovers real-world performance ($60\\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\\%$ at the same step budget.","short_abstract":"World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods b...","url_abs":"https://arxiv.org/abs/2606.05254","url_pdf":"https://arxiv.org/pdf/2606.05254v1","authors":"[\"Arman Akbari\",\"Ci Zhang\",\"Arash Akbari\",\"Lin Zhao\",\"Yixiao Chen\",\"Weiwei Chen\",\"Xuan Zhang\",\"Geng Yuan\",\"Yanzhi Wang\"]","published":"2026-06-03T15:29:57Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CV\",\"cs.RO\"]","methods":"[\"Diffusion Model\"]","has_code":false}
