{"ID":2892720,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.14426","arxiv_id":"2507.14426","title":"CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding","abstract":"We introduce CRAFT, a neuro-symbolic framework for interpretable affordance grounding, which identifies the objects in a scene that enable a given action (e.g., \"cut\"). CRAFT integrates structured commonsense priors from ConceptNet and language models with visual evidence from CLIP, using an energy-based reasoning loop to refine predictions iteratively. This process yields transparent, goal-driven decisions to ground symbolic and perceptual structures. Experiments in multi-object, label-free settings demonstrate that CRAFT enhances accuracy while improving interpretability, providing a step toward robust and trustworthy scene understanding.","short_abstract":"We introduce CRAFT, a neuro-symbolic framework for interpretable affordance grounding, which identifies the objects in a scene that enable a given action (e.g., \"cut\"). CRAFT integrates structured commonsense priors from ConceptNet and language models with visual evidence from CLIP, using an energy-based reasoning loop...","url_abs":"https://arxiv.org/abs/2507.14426","url_pdf":"https://arxiv.org/pdf/2507.14426v1","authors":"[\"Zhou Chen\",\"Joe Lin\",\"Sathyanarayanan N. Aakur\"]","published":"2025-07-19T01:06:29Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}