{"ID":2844460,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.06516","arxiv_id":"2511.06516","title":"You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations","abstract":"Many LLM applications require only narrow capabilities, yet standard post-training quantization (PTQ) methods allocate precision without considering the target task. This can waste bits on layers that are less relevant to the task signal while over-compressing layers that are critical for downstream behavior. We propose Task-Aware Quantization (TAQ), a training-free, weight-only mixed-precision PTQ framework that uses a small set of unlabeled task calibration prompts to allocate higher precision to task-relevant transformer layers under a fixed bit budget. TAQ estimates layer importance from hidden representations and output sensitivity, and we instantiate it with three scoring rules: TAQ-IS, based on activation information and stability; TAQ-KL, based on output-distribution sensitivity under a quantization-noise proxy; and TAQ-O, a label-informed oracle diagnostic for analyzing layer sensitivity. Across several benchmarks, TAQ outperforms task-agnostic baselines such in most settings, with especially strong gains in the accuracy--memory ratio. We further validate that these gains translate to real deployment behavior through hardware throughput and latency measurements, and analyze calibration robustness and residual-stream error propagation. Overall, TAQ turns mixed-precision PTQ from a model-centric compression step into a task-conditioned precision-allocation problem. A reference implementation is available at https://anonymous.4open.science/r/TAQ-9217/README.md.","short_abstract":"Many LLM applications require only narrow capabilities, yet standard post-training quantization (PTQ) methods allocate precision without considering the target task. This can waste bits on layers that are less relevant to the task signal while over-compressing layers that are critical for downstream behavior. We propos...","url_abs":"https://arxiv.org/abs/2511.06516","url_pdf":"https://arxiv.org/pdf/2511.06516v3","authors":"[\"Amit LeVi\",\"Raz Lapid\",\"Rom Himelstein\",\"Chaim Baskin\",\"Ravid Shwartz Ziv\",\"Avi Mendelson\"]","published":"2025-11-09T19:58:24Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Transformer\",\"Large Language Model\"]","project_urls":"[\"https://anonymous.4open.science/r/TAQ-9217/README.md\"]","has_code":false}
