{"ID":3138320,"CreatedAt":"2026-06-05T23:42:33.029511562Z","UpdatedAt":"2026-06-06T08:42:33.101913816Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2604.18739","arxiv_id":"2604.18739","title":"Discrete Tilt Matching","abstract":"Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM's annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.","short_abstract":"Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models....","url_abs":"https://arxiv.org/abs/2604.18739","url_pdf":"https://arxiv.org/pdf/2604.18739v3","authors":"[\"Yuyuan Chen\",\"Shiyi Wang\",\"Peter Potaptchik\",\"Jaeyeon Kim\",\"Michael S. Albergo\"]","published":"2026-04-20T18:43:37Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"stat.ML\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false}
