{"ID":2828790,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.12990","arxiv_id":"2512.12990","title":"SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference","abstract":"MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37x and 2.85x, respectively, and improves decode latency by up to 1.81x and 1.64x, while preserving near-high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.","short_abstract":"MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE,...","url_abs":"https://arxiv.org/abs/2512.12990","url_pdf":"https://arxiv.org/pdf/2512.12990v4","authors":"[\"Yuseon Choi\",\"Sangjin Kim\",\"Jungjun Oh\",\"Gwangtae Park\",\"Byeongcheol Kim\",\"Hoi-Jun Yoo\"]","published":"2025-12-15T05:33:07Z","proceeding":"cs.AR","tasks":"[\"cs.AR\"]","methods":"[]","has_code":false}