{"ID":2921797,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:56:00.181519634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01332","arxiv_id":"2606.01332","title":"S2M-Trek: From Single to Multi-Sphere Transport via Per-Frame Deep Sets on a Wheel-Legged Robot","abstract":"We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: their ordering may change independently at each history frame, creating a \\emph{per-frame permutation symmetry} that standard history-concatenation set encoders do not explicitly enforce -- these encoders impose only a shared, diagonal permutation symmetry over the full history. We show that this symmetry mismatch leads to a concrete failure mode in curriculum-based reinforcement learning. Within the same PPO training budget, flat MLPs and branch-wise encoders plateau at or below the two-sphere stage, while a history-concatenation Deep Sets baseline (\\HCDS) fails to progress past the two-sphere stage in our runs unless ball-to-slot assignments are randomised during training, suggesting that it exploits slot indices as a curriculum shortcut rather than learning identity-free multi-sphere dynamics. We propose \\textbf{Per-Frame Deep Sets (\\PFDS)}, which performs permutation-invariant pooling within each history frame before temporal readout; we prove that \\PFDS is $\\Gframe$-invariant and universally approximates continuous $\\Gframe$-invariant policies. A $2{\\times}2$ ablation over encoder architecture and slot randomisation separates the architectural and data-augmentation pathways, and \\PFDS reaches the five-sphere stage with 100\\% no-drop transport in simulation across all five random seeds. We further distill the \\PFDS teacher into \\TactSet via DAgger, replacing privileged sphere-state observations with a $16{\\times}16$ Boolean union contact map, yielding a compact and naturally $\\Gframe$-invariant tactile representation.","short_abstract":"We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: th...","url_abs":"https://arxiv.org/abs/2606.01332","url_pdf":"https://arxiv.org/pdf/2606.01332v1","authors":"[\"Zong Chen\",\"Xuebin Li\",\"Jinpeng Xiao\",\"Shaoyang Li\",\"Ben Liu\",\"Min Li\",\"Zhouping Yin\",\"Yiqun Li\"]","published":"2026-05-31T16:35:38Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}