{"ID":3004831,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T11:43:53.432517148Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03614","arxiv_id":"2606.03614","title":"OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination","abstract":"Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \\emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired supported/counterfactual claims from the same audio-visual event evidence and evaluates them by strict-pair accuracy. We instantiate it as \\bench, a benchmark for long-video Omni hallucination, with 3{,}600 single-claim QA items from 638 long-form videos averaging 24.16 minutes and covering 256.87 hours. Under this protocol, open-weight Omni models remain weak at pair-level binding: Qwen2.5-Omni-7B reaches 32.06\\% and Qwen3-Omni-Instruct reaches 41.55\\%, versus 76.54\\% for a closed-source reference. To narrow this gap without updating the backbone, we propose \\method, Modality-Perturbation Reliability Calibration, a frozen-backbone framework that selects audio-negative probes within video-level folds and fuses their response shifts with native audio-visual confidence into per-claim support estimates. \\method lifts Qwen2.5-Omni-7B to 36.22\\% and Qwen3 to 51.09\\% on \\bench, and improves target-adapted MCQ accuracy on OmniVideoBench ($+$2.20) and WorldSense ($+$1.51) with Qwen3.","short_abstract":"Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \\emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level sco...","url_abs":"https://arxiv.org/abs/2606.03614","url_pdf":"https://arxiv.org/pdf/2606.03614v1","authors":"[\"Zixuan Dong\",\"Jiafu Tang\",\"Zhide Lei\",\"Zhe Cao\",\"Zijie Zhang\",\"Yanghai Wang\",\"Shihao Li\",\"Xiaodong Wang\",\"Baoyun Peng\",\"Jiaheng Liu\"]","published":"2026-06-02T13:14:02Z","proceeding":"cs.MM","tasks":"[\"cs.MM\"]","methods":"[]","has_code":false}
