{"ID":2870003,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14234","arxiv_id":"2509.14234","title":"Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision","abstract":"Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels-critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this framework Compute as Teacher (CaT) and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we call synthesis, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9x less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating 'drop-in' versatility across both types of domains.","short_abstract":"Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels-critically, even in non-verifiable domains like healthca...","url_abs":"https://arxiv.org/abs/2509.14234","url_pdf":"https://arxiv.org/pdf/2509.14234v3","authors":"[\"Dulhan Jayalath\",\"Shashwat Goel\",\"Thomas Foster\",\"Parag Jain\",\"Suchin Gururangan\",\"Cheng Zhang\",\"Anirudh Goyal\",\"Alan Schelten\"]","published":"2025-09-17T17:59:42Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}
