{"ID":2824715,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23087","arxiv_id":"2512.23087","title":"Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail","abstract":"Reinforcement Learning (RL) for Large Language Models (LLMs) faces a fundamental tension: the numerical divergence between high-throughput inference engines and numerically precise training engines. Although these systems share the same parameters, they produce slightly different probability distributions, creating a training-inference mismatch. We prove that the bound on the log-probability divergence arising from this mismatch scales as $(1-p)$, where $p$ is the token probability. This scaling induces a highly asymmetric effect: the bound vanishes for high-probability tokens but remains significant for low-probability tokens in the distribution tail. When sampled, these tail tokens introduce systematically biased errors that accumulate over sequences, thereby destabilizing gradient estimation. Instead of applying post-hoc corrections, we propose Dynamic Vocabulary Pruning (DVP), which constrains the RL objective to a dynamically determined ''safe'' vocabulary that excludes the extreme tail. This strategy trades large, destabilizing numerical errors for a small, bounded optimization bias. We validate DVP empirically by demonstrating stable training, and theoretically by deriving strict bounds on the induced bias.","short_abstract":"Reinforcement Learning (RL) for Large Language Models (LLMs) faces a fundamental tension: the numerical divergence between high-throughput inference engines and numerically precise training engines. Although these systems share the same parameters, they produce slightly different probability distributions, creating a t...","url_abs":"https://arxiv.org/abs/2512.23087","url_pdf":"https://arxiv.org/pdf/2512.23087v2","authors":"[\"Yingru Li\",\"Jiawei Xu\",\"Jiacai Liu\",\"Yuxuan Tong\",\"Ziniu Li\",\"Tianle Cai\",\"Ge Zhang\",\"Qian Liu\",\"Baoxiang Wang\"]","published":"2025-12-28T21:44:07Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"stat.ML\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false}
