Code Comprehension with GitHub Copilot: Performance Gains, Comprehension Trade-offs, and Behavioral Predictors in Brownfield Programming
Abstract
Teaching Computer Science (CS) students how to comprehend and maintain legacy code bases is a critical challenge in software engineering education. While Generative AI (GenAI) assistants like GitHub Copilot improve task completion speed and correctness, their impact on code understanding remains unclear. We conducted a within-subject study with 15 graduate CS students completing feature implementation tasks with and without Copilot. Despite significant performance improvements, participants showed no overall comprehension improvement ($p=0.59$), revealing a \textit{comprehension-performance decoupling}. Further analysis uncovered a \textit{comprehension trade-off}: performance gains negatively correlated with reverse engineering comprehension ($ρ=-0.57$, $p=0.026$) but showed a positive trend with implementation comprehension ($ρ=0.50$, $p=0.06$). A follow-up behavioral analysis revealed that \textit{how} students used Copilot determined outcomes: Engaging in verification loops in which programmers actively reviewed generated code strongly predicted comprehension ($p<0.001$, $r=0.96$), with high-comprehension participants verifying code 4.7 times more frequently than low-comprehension participants. These findings suggest that GenAI tools do not inherently undermine comprehension; rather, passive consumption patterns do. This suggests a need to alter programming education to teach system-level verification skills, and the need to redesign educational GenAI tools to scaffold active cognitive engagement.