{"ID":2855343,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13797","arxiv_id":"2510.13797","title":"Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons","abstract":"The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.","short_abstract":"The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes...","url_abs":"https://arxiv.org/abs/2510.13797","url_pdf":"https://arxiv.org/pdf/2510.13797v3","authors":"[\"Giovanni Monea\",\"Yair Feldman\",\"Shankar Padmanabhan\",\"Kianté Brantley\",\"Yoav Artzi\"]","published":"2025-10-15T17:57:21Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Transformer\",\"Language Model\"]","has_code":false}
