{"ID":2848465,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.25278","arxiv_id":"2510.25278","title":"DIRC-RAG: Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation","abstract":"Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in-situ parallel retrievals but is constrained by either low memory density or limited computational accuracy. To address these challenges, we present DIRCRAG, a novel edge RAG acceleration architecture leveraging Digital In-ReRAM Computation (DIRC). DIRC integrates a high-density multi-level ReRAM subarray with an SRAM cell, utilizing SRAM and differential sensing for robust ReRAM readout and digital multiply-accumulate (MAC) operations. By storing all document embeddings within the CIM macro, DIRC achieves ultra-low-power, single-cycle data loading, substantially reducing both energy consumption and latency compared to offchip DRAM. A query-stationary (QS) dataflow is supported for RAG tasks, minimizing on-chip data movement and reducing SRAM buffer requirements. We introduce error optimization for the DIRC ReRAM-SRAM cell by extracting the bit-wise spatial error distribution of the ReRAM subarray and applying targeted bit-wise data remapping. An error detection circuit is also implemented to enhance readout resilience against deviceand circuit-level variations. Simulation results demonstrate that DIRC-RAG under TSMC40nm process achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput of 131 TOPS. It delivers a 4MB retrieval latency of 5.6μs/query and an energy consumption of 0.956μJ/query, while maintaining the retrieval precision.","short_abstract":"Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in...","url_abs":"https://arxiv.org/abs/2510.25278","url_pdf":"https://arxiv.org/pdf/2510.25278v1","authors":"[\"Kunming Shao\",\"Zhipeng Liao\",\"Jiangnan Yu\",\"Liang Zhao\",\"Qiwei Li\",\"Xijie Huang\",\"Jingyu He\",\"Fengshi Tian\",\"Yi Zou\",\"Xiaomeng Wang\",\"Tim Kwang-Ting Cheng\",\"Chi-Ying Tsui\"]","published":"2025-10-29T08:38:02Z","proceeding":"cs.AR","tasks":"[\"cs.AR\"]","methods":"[\"RAG\",\"Large Language Model\",\"Language Model\"]","has_code":false}
