{"ID":2840367,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12987","arxiv_id":"2511.12987","title":"Reuse, Don't Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration","abstract":"Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.","short_abstract":"Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reus...","url_abs":"https://arxiv.org/abs/2511.12987","url_pdf":"https://arxiv.org/pdf/2511.12987v3","authors":"[\"Daivik Patel\",\"Shrenik Patel\"]","published":"2025-11-17T05:16:25Z","proceeding":"cs.MA","tasks":"[\"cs.MA\"]","methods":"[]","has_code":false}
