{"ID":2847228,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.00488","arxiv_id":"2511.00488","title":"\\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs","abstract":"Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \\emph{deductive code reasoning}, the ability to reason about the program execution process. While prior studies have recognized this limitation, the underlying causes remain largely underexplored. In this paper, we begin by presenting a comprehensive empirical study that reveals three key challenges undermining deductive code reasoning: (1) an intrinsic gap between generation and reasoning abilities, (2) a consistent bias towards code sources, and (3) weak zero-shot generalization on complex benchmarks. In light of these challenges, we propose \\texttt{ReMind}, a multi-agent framework composed of \\texttt{Mutator}, \\texttt{Executor}, and \\texttt{Inspector}. The \\texttt{Mutator} generates code variants to mitigate bias towards code sources, the \\texttt{Executor} traces variable states step-by-step to expose inconsistency, and the \\texttt{Inspector} identifies problematic reasoning steps and provides control-flow refinement to bridge the intrinsic reasoning gap. Through their coordinated collaboration, \\texttt{ReMind} systematically identifies and refines reasoning flaws, achieving outstanding performance and enabling robust zero-shot generalization. Extensive experiments on two benchmarks with five LLMs demonstrate the superior advantages of \\texttt{ReMind} compared to baseline approaches in deductive code reasoning.","short_abstract":"Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \\emph{deductive code reasoning}, the ability to reason about the program execution process. While prior studies have recognized this limitation, the u...","url_abs":"https://arxiv.org/abs/2511.00488","url_pdf":"https://arxiv.org/pdf/2511.00488v1","authors":"[\"Jun Gao\",\"Yun Peng\",\"Xiaoxue Ren\"]","published":"2025-11-01T10:42:40Z","proceeding":"cs.PL","tasks":"[\"cs.PL\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
