{"ID":2883947,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.08446","arxiv_id":"2508.08446","title":"OverFill: Two-Stage Models for Efficient Language Model Decoding","abstract":"Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average across standard benchmarks. OverFill matches the performance of same-sized models trained from scratch, while using significantly less training data. Our code is available at https://github.com/friendshipkim/overfill.","short_abstract":"Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages...","url_abs":"https://arxiv.org/abs/2508.08446","url_pdf":"https://arxiv.org/pdf/2508.08446v1","authors":"[\"Woojeong Kim\",\"Junxiong Wang\",\"Jing Nathan Yan\",\"Mohamed Abdelfattah\",\"Alexander M. Rush\"]","published":"2025-08-11T20:07:34Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611036,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2883947,"paper_url":"https://arxiv.org/abs/2508.08446","paper_title":"OverFill: Two-Stage Models for Efficient Language Model Decoding","repo_url":"https://github.com/friendshipkim/overfill","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
