{"ID":2856891,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10787","arxiv_id":"2510.10787","title":"Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG","abstract":"The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.","short_abstract":"The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computa...","url_abs":"https://arxiv.org/abs/2510.10787","url_pdf":"https://arxiv.org/pdf/2510.10787v1","authors":"[\"Zhichao Wang\",\"Cheng Wan\",\"Dong Nie\"]","published":"2025-10-12T20:09:07Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Generative Adversarial Network\"]","has_code":false}