{"ID":2896433,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.06571","arxiv_id":"2507.06571","title":"Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis","abstract":"We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2\\%, reduces FID by 37.8\\%, and boosts CLIP alignment by 31.1\\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\\% to 7.3\\%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1\\% accurate image reuse and 85\\% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.","short_abstract":"We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Met...","url_abs":"https://arxiv.org/abs/2507.06571","url_pdf":"https://arxiv.org/pdf/2507.06571v1","authors":"[\"Srihari K B\",\"Pushpak Bhattacharyya\"]","published":"2025-07-09T05:59:06Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Diffusion Model\"]","has_code":false}
