{"ID":2862056,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00820","arxiv_id":"2510.00820","title":"NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution","abstract":"Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: https://github.com/Xiangtaokong/NSARM","short_abstract":"Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is e...","url_abs":"https://arxiv.org/abs/2510.00820","url_pdf":"https://arxiv.org/pdf/2510.00820v1","authors":"[\"Xiangtao Kong\",\"Rongyuan Wu\",\"Shuaizheng Liu\",\"Lingchen Sun\",\"Lei Zhang\"]","published":"2025-10-01T12:29:58Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":608868,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2862056,"paper_url":"https://arxiv.org/abs/2510.00820","paper_title":"NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution","repo_url":"https://github.com/Xiangtaokong/NSARM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}