TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling

cs.AR arXiv:2509.03377
View PDF arXiv JSON

Abstract

LLM inference is increasingly limited by memory bandwidth, and the bottleneck worsens at long context as the KV cache grows. CXL memory adds capacity to offload weights and KV, but its link and device-side DDR bandwidth are far below HBM, so decoding stalls once traffic shifts to the CXL tier. Many CXL controllers are starting to add generic \emph{lossless} compression, yet applying commodity codecs directly to standard word-major LLM tensors is largely ineffective, especially for token-major KV streams. We propose TRACE (\textbf{T}raffic-\textbf{R}educed \textbf{A}rchitecture for \textbf{C}ompression and \textbf{E}lasticity), which preserves the unmodified CXL.mem interface but changes the device-internal representation. It stores tensors in a channel-major, disaggregated bit-plane layout, and applies a KV-specific transform before compression, converting mixed-field words into low-entropy plane streams that commodity codecs can compress. The same substrate enables precision-proportional fetch by reading only the required bit-planes. Across public LLMs, TRACE reduces BF16 weight footprint by 25.2\% and BF16 KV footprint by 46.9\% losslessly, with per-layer KV ratios peaking at 2.69$\times$. In trace-driven system modeling, once KV spills to CXL, GPT-OSS-120B-MXFP4 improves throughput at 128k tokens from 16.28 to 68.99 tok/s (4.24$\times$). DRAMSim3 shows up to 40.3\% lower DRAM access energy under plane-aligned fetch. A 7\,nm SystemVerilog implementation sustains 256\,GB/s device bandwidth. Relative to a CXL controller with generic inline lossless compression, TRACE only adds 7.2\% area, 4.7\% power, and 6.0\% load-to-use latency at 2\,GHz and 0.7\,V.

PDF Viewer