{"ID":2833552,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.03870","arxiv_id":"2512.03870","title":"Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers","abstract":"Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.","short_abstract":"Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To under...","url_abs":"https://arxiv.org/abs/2512.03870","url_pdf":"https://arxiv.org/pdf/2512.03870v3","authors":"[\"Hongzhan Lin\",\"Zhiqi Bai\",\"Xinmiao Zhang\",\"Sen Yang\",\"Xiang Li\",\"Siran Yang\",\"Yunlong Xu\",\"Jiaheng Liu\",\"Yongchi Zhao\",\"Jiamang Wang\",\"Yuchi Xu\",\"Wenbo Su\",\"Bo Zheng\"]","published":"2025-12-03T15:22:00Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Transformer\",\"Large Language Model\"]","has_code":false}
