{"ID":2858447,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08798","arxiv_id":"2510.08798","title":"Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models","abstract":"Transformer attention scales quadratically with sequence length O(n^2), limiting long-context use. We propose Adaptive Retention, a probabilistic, layer-wise token selection mechanism that learns which representations to keep under a strict global budget M. Retention is modeled with Bernoulli gates trained via a Hard-Concrete/variational relaxation and enforced with a simple top-M rule at inference, making the method differentiable and drop-in for standard encoders. Across classification, extractive QA, and long-document summarization, keeping only 30-50% of tokens preserves \u003e= 95% of full-model performance while cutting peak memory by ~35-45% and improving throughput by up to ~1.8x. This architecture-agnostic approach delivers practical long-context efficiency without modifying base attention or task heads.","short_abstract":"Transformer attention scales quadratically with sequence length O(n^2), limiting long-context use. We propose Adaptive Retention, a probabilistic, layer-wise token selection mechanism that learns which representations to keep under a strict global budget M. Retention is modeled with Bernoulli gates trained via a Hard-C...","url_abs":"https://arxiv.org/abs/2510.08798","url_pdf":"https://arxiv.org/pdf/2510.08798v1","authors":"[\"S M Rafiuddin\",\"Muntaha Nujat Khan\"]","published":"2025-10-09T20:26:47Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
