{"ID":3084794,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T02:49:24.740369373Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05624","arxiv_id":"2606.05624","title":"KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion","abstract":"Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \\textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.","short_abstract":"Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, an...","url_abs":"https://arxiv.org/abs/2606.05624","url_pdf":"https://arxiv.org/pdf/2606.05624v1","authors":"[\"Tengjiao Sun\",\"Pengcheng Fang\",\"Xiaoyu Zhan\",\"Yanwen Guo\",\"Dongjie Fu\",\"Xiaohao Cai\",\"Hansung Kim\"]","published":"2026-06-04T02:50:20Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.GR\"]","methods":"[\"Transformer\"]","has_code":false}
