{"ID":2841000,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.21702","arxiv_id":"2511.21702","title":"CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference","abstract":"Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top-$k$ certification and $\\varepsilon$-certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \\href{https://github.com/FastLM/CSV-Decode}{https://github.com/FastLM/CSV-Decode}.","short_abstract":"Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computat...","url_abs":"https://arxiv.org/abs/2511.21702","url_pdf":"https://arxiv.org/pdf/2511.21702v1","authors":"[\"Dong Liu\",\"Yanxuan Yu\",\"Ben Lengerich\"]","published":"2025-11-16T14:02:41Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":607020,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2841000,"paper_url":"https://arxiv.org/abs/2511.21702","paper_title":"CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference","repo_url":"https://github.com/FastLM/CSV-Decode","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}