Context-Picker: Dynamic context selection using multi-stage reinforcement learning
Abstract
In long-context question answering, selecting the appropriate scope of context for a query remains a key and unresolved challenge. Insufficient context can lead to missing essential information, whereas excessive context often introduces noise and degrades answer quality. Conventional methods, such as retrieving a fixed number of passages or applying reranking, struggle to dynamically determine which context to include. This is especially problematic for factoid questions, which typically depend only on a few precise pieces of evidence. To overcome this limitation, we propose Context-Picker, a reasoning-aware framework that reframes context selection as the task of identifying a minimal sufficient evidence subset, moving beyond conventional similarity-based ranking. Context-Picker uses a human-inspired two-stage reinforcement learning schedule: stage 1 focuses on improving the recall rate of critical passages, and stage 2 prioritizes pruning redundancy to distill a compact evidence set. To resolve reward sparsity, we propose an offline evidence distillation pipeline that mines ``minimal sufficient sets" via a Leave-One-Out (LOO) procedure, providing dense and task-aligned supervision. Experiments on five long-context and multi-hop QA datasets demonstrate that our method outperforms strong RAG baselines and achieved higher answer accuracy. Ablation studies also indicate that our coarse-to-fine optimization schedule, the redundancy-aware reward shaping, along with the rationale generated by the policy, all contribute substantially to these gains.