{"ID":2838671,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.17358","arxiv_id":"2511.17358","title":"Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding","abstract":"We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.","short_abstract":"We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. W...","url_abs":"https://arxiv.org/abs/2511.17358","url_pdf":"https://arxiv.org/pdf/2511.17358v1","authors":"[\"Daniil Ignatev\",\"Ayman Santeer\",\"Albert Gatt\",\"Denis Paperno\"]","published":"2025-11-21T16:23:17Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
