{"ID":3004870,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T11:43:53.432517148Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03540","arxiv_id":"2606.03540","title":"Attend to Anything: Foundation Model for Unified Human Attention Modeling","abstract":"Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\\% across various scenarios, while achieving approximately a 4$\\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.","short_abstract":"Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world app...","url_abs":"https://arxiv.org/abs/2606.03540","url_pdf":"https://arxiv.org/pdf/2606.03540v1","authors":"[\"Wenzhuo Zhao\",\"Ronghao Xian\",\"Keren Fu\",\"Qijun Zhao\"]","published":"2026-06-02T12:00:21Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":612714,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-03T03:09:48.883664427Z","DeletedAt":null,"paper_id":3004870,"paper_url":"https://arxiv.org/abs/2606.03540","paper_title":"Attend to Anything: Foundation Model for Unified Human Attention Modeling","repo_url":"https://github.com/wz-zhao/Attend-to-Anything","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
