{"ID":2877513,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.19638","arxiv_id":"2508.19638","title":"Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception","abstract":"Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations, which discard critical fine-grained 3D structural cues essential for accurate object recognition and localization. To this end, we first introduce point-level tokens as intermediate representations for collaborative perception. However, point-cloud data are inherently unordered, massive, and position-sensitive, making it challenging to produce compact and aligned point-level token sequences that preserve detailed structural information. Therefore, we present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens. It incorporates a point-native processing pipeline, including token reordering, sequence modeling, and multi-agent spatial alignment. A semantic-aware token reordering module generates adaptive 1D reorderings by leveraging scene-level and token-level semantic information. A frequency-enhanced state space model captures long-range sequence dependencies across both spatial and spectral domains, improving the differentiation between foreground tokens and background clutter. Lastly, a neighbor-to-ego alignment module applies a closed-loop process, combining global agent-level correction with local token-level refinement to mitigate localization noise. Extensive experiments on both simulated and real-world datasets show that CoPLOT outperforms state-of-the-art models, with even lower communication and computation overhead. Code will be available at https://github.com/CheeryLeeyy/CoPLOT.","short_abstract":"Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations, which discard critical fine-grained 3D structural cues essential for accurate object recogniti...","url_abs":"https://arxiv.org/abs/2508.19638","url_pdf":"https://arxiv.org/pdf/2508.19638v1","authors":"[\"Yang Li\",\"Quan Yuan\",\"Guiyang Luo\",\"Xiaoyuan Fu\",\"Rui Pan\",\"Yujia Yang\",\"Congzhang Shao\",\"Yuewen Liu\",\"Jinglin Li\"]","published":"2025-08-27T07:27:42Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":610389,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877513,"paper_url":"https://arxiv.org/abs/2508.19638","paper_title":"Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception","repo_url":"https://github.com/CheeryLeeyy/CoPLOT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
