{"ID":2832346,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.05325","arxiv_id":"2512.05325","title":"LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning","abstract":"Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often \"overthink\": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., \"hmm\", \"wait\") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.","short_abstract":"Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often \"overthink\": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either mani...","url_abs":"https://arxiv.org/abs/2512.05325","url_pdf":"https://arxiv.org/pdf/2512.05325v1","authors":"[\"Ömer Faruk Akgül\",\"Yusuf Hakan Kalaycı\",\"Rajgopal Kannan\",\"Willie Neiswanger\",\"Viktor Prasanna\"]","published":"2025-12-05T00:04:42Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[]","has_code":false}