{"ID":2922103,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T13:54:14.569670787Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.00739","arxiv_id":"2606.00739","title":"Score $\\times$ Decoder: A Unified View of Unsupervised Inference-Time Scaling for Hallucination Mitigation","abstract":"Large language models hallucinate even when the answer lies within their parameters. While inference-time scaling can surface this latent knowledge, the most effective methods require supervision: a trained verifier or reward model. We ask what can be done with only a base language model: which intrinsic signal best identifies correct outputs, and how should it be decoded? We cast this as a score~$\\times$~decoder grid pairing four scores (perplexity, contrastive, power-distribution likelihood, and self-verification) with three decoding families (optimization, sampling, consensus), and evaluate every cell on MATH500 with the base and instruction-tuned Qwen3-1.7B. While self-verification, which prompts the model to judge its own answer and is sharpened by a training-free virtual-thinking prefix, works well in most settings, no score has a fixed quality: its value depends on the decoder that consumes it and on model capability. When no supervision is available, the score and the decoding family must be chosen together.","short_abstract":"Large language models hallucinate even when the answer lies within their parameters. While inference-time scaling can surface this latent knowledge, the most effective methods require supervision: a trained verifier or reward model. We ask what can be done with only a base language model: which intrinsic signal best id...","url_abs":"https://arxiv.org/abs/2606.00739","url_pdf":"https://arxiv.org/pdf/2606.00739v1","authors":"[\"Yun-Chen Cheng\",\"Che-Yu Lin\",\"Cheng-Lin Yang\"]","published":"2026-05-30T14:13:52Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
