{"ID":2922079,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T12:17:29.810469831Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.00703","arxiv_id":"2606.00703","title":"Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation","abstract":"Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievability -- algorithms and empirical scaling laws -- with no matching characterization of what is information-theoretically possible. We study a B-bit quantized stochastic first-order oracle: an optimizer interacts for T rounds and receives, each round, a B-bit adaptive public-coin description of its stochastic gradient. Our main contribution is an exact reduction from optimizing a strongly convex quadratic family to interactively compressed Gaussian mean estimation -- under the B-bit oracle the query carries no information, so optimization collapses exactly onto a sequential distributed-estimation problem. This yields two unconditional lower bounds, a communication bound TB = Omega(d) and a statistical bound T = Omega(sigma^2 d / eps^2), and the sharp product-form bound T = Omega((sigma^2 d / eps^2) max{1, d/B}). The product form is also unconditional: a B-bit transcript carries at most O(TB / sigma^2) of Fisher trace about the mean, so bits rather than dimension limit the recoverable information, and combined with the multivariate van Trees inequality this gives the bound directly, without bounded-likelihood-ratio truncation. We give a near-matching achievability result with exact per-round bit accounting under a bounded-dynamic-range oracle, tight up to a logarithmic factor; the lower bound is for truly Gaussian (unbounded) gradients, and closing this oracle gap is left open. A sequential rate-distortion perspective extends the reduction to correlated and drifting oracles and corrects an earlier conjecture: positive noise correlation raises the bound by (1+rho)/(1-rho) rather than relaxing it. The bounds give an information-theoretic baseline for any low-bit gradient path, not an optimality claim about deployed FP4 systems.","short_abstract":"Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievability -- algorithms and empirical scaling laws -- with no matching characterization of what is information-theoretically possible. We study a B-bit quantized stochastic first-order o...","url_abs":"https://arxiv.org/abs/2606.00703","url_pdf":"https://arxiv.org/pdf/2606.00703v1","authors":"[\"Munsik Kim\"]","published":"2026-05-30T12:22:20Z","proceeding":"cs.IT","tasks":"[\"cs.IT\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
