{"ID":2921886,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T22:46:55.310989306Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01485","arxiv_id":"2606.01485","title":"Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering","abstract":"We describe our submission to the VRR Challenge @ CVPR 2026, built on the \\emph{ImplicitQA} / \\emph{VRR-QA} benchmark~\\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \\emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL~\\cite{qwen25vl}, Qwen3-VL~\\cite{qwen3vl}, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1~\\cite{videor1} and VideoChat-R1.5~\\cite{videochatr15}) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency~\\cite{selfconsistency}, multi-model ensembling, and category routing). Our central finding is that this benchmark is \\emph{perception-bound rather than reasoning-bound}: reasoning-side augmentations are neutral-to-harmful, whereas base-model perceptual capability and lightweight test-time denoising are the only reliable levers. A per-category error analysis localizes the difficulty to low-level perception -- relative depth, viewpoint, and counting are the hardest categories, while causal and social reasoning are nearly solved -- and a prompt that explicitly injects monocular depth cues to attack the weakest category \\emph{lowers} test accuracy by $5.8$ points, confirming that the model needs a better \\emph{percept}, not a better \\emph{procedure}.","short_abstract":"We describe our submission to the VRR Challenge @ CVPR 2026, built on the \\emph{ImplicitQA} / \\emph{VRR-QA} benchmark~\\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \\emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint...","url_abs":"https://arxiv.org/abs/2606.01485","url_pdf":"https://arxiv.org/pdf/2606.01485v1","authors":"[\"Ali Alavi\"]","published":"2026-05-31T23:00:17Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\"]","methods":"[]","has_code":false}
