{"ID":2861962,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00652","arxiv_id":"2510.00652","title":"OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding","abstract":"We introduce OTTER, a unified open-set multi-label tagging framework that harmonizes the stability of a curated, predefined category set with the adaptability of user-driven open tags. OTTER is built upon a large-scale, hierarchically organized multi-modal dataset, collected from diverse online repositories and annotated through a hybrid pipeline combining automated vision-language labeling with human refinement. By leveraging a multi-head attention architecture, OTTER jointly aligns visual and textual representations with both fixed and open-set label embeddings, enabling dynamic and semantically consistent tagging. OTTER consistently outperforms competitive baselines on two benchmark datasets: it achieves an overall F1 score of 0.81 on Otter and 0.75 on Favorite, surpassing the next-best results by margins of 0.10 and 0.02, respectively. OTTER attains near-perfect performance on open-set labels, with F1 of 0.99 on Otter and 0.97 on Favorite, while maintaining competitive accuracy on predefined labels. These results demonstrate OTTER's effectiveness in bridging closed-set consistency with open-vocabulary flexibility for multi-modal tagging applications.","short_abstract":"We introduce OTTER, a unified open-set multi-label tagging framework that harmonizes the stability of a curated, predefined category set with the adaptability of user-driven open tags. OTTER is built upon a large-scale, hierarchically organized multi-modal dataset, collected from diverse online repositories and annotat...","url_abs":"https://arxiv.org/abs/2510.00652","url_pdf":"https://arxiv.org/pdf/2510.00652v1","authors":"[\"Jieer Ouyang\",\"Xiaoneng Xiang\",\"Zheng Wang\",\"Yangkai Ding\"]","published":"2025-10-01T08:31:19Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Generative Adversarial Network\"]","has_code":false}
