PseudoBridge: Pseudo Code as the Bridge for Better Semantic and Logic Alignment in Code Retrieval
Abstract
Code retrieval aims to find relevant code snippets matching natural language queries within massive codebases, playing a vital role in software development. Recent advances leverage PLMs to bridge the semantic gap between natural language (NL) and programming languages (PL), significantly outperforming traditional information retrieval and early deep learning approaches. However, existing methods still face key challenges, including a fundamental semantic gap between human intent and machine execution logic, and limited robustness to diverse code styles. To address this, we propose PseudoBridge, a novel code retrieval framework that introduces pseudo-code as an intermediate, semi-structured modality to align NL semantics with PL logic. Specifically, PseudoBridge consists of two stages: First, we employ an LLM to synthesize pseudo-code, enabling explicit alignment between NL queries and pseudo-code. Second, we introduce a logic-invariant code style augmentation strategy, employing the LLM to generate stylistically diverse yet logically equivalent code implementations, and then align these varied code styles with pseudo-code to enhance robustness. We evaluate PseudoBridge across 10 PLMs and 6 mainstream programming languages. Extensive experiments demonstrate that PseudoBridge consistently outperforms baselines, achieving significant improvements in generalization, particularly in zero-shot scenarios like Solidity and XLCoST. Extended evaluations using open-source LLMs and advanced embeddings confirm that these gains stem from PseudoBridge's intrinsic design, independent of specific closed-source models. PseudoBridge achieves performance comparable to SOTA embedding methods, highlighting the effectiveness of explicit logical and semantic alignment via pseudo-code as a robust solution for code retrieval.