{"ID":2826402,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20687","arxiv_id":"2512.20687","title":"PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation","abstract":"Transformers operate as horizontal token-by-token scanners; at each generation step, attending to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding more memory-bound, as KV-cache reads and writes dominate inference time over arithmetic operations. We propose Parallel Hierarchical Operation for TOp-down Networks (PHOTON), a hierarchical autoregressive model that replaces horizontal scanning with vertical, multi-resolution context scanning. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations in parallel. We further introduce recursive generation that updates only the coarsest latent stream and eliminates bottom-up re-encoding. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, providing advantages in long-context and multi-query tasks. In particular, this reduces decode-time KV-cache traffic, yielding up to $10^{3}\\times$ higher throughput per unit memory.","short_abstract":"Transformers operate as horizontal token-by-token scanners; at each generation step, attending to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding more memory-bound, as KV-cache reads and writes dominate inference time over arithmetic operatio...","url_abs":"https://arxiv.org/abs/2512.20687","url_pdf":"https://arxiv.org/pdf/2512.20687v2","authors":"[\"Yuma Ichikawa\",\"Naoya Takagi\",\"Takumi Nakagawa\",\"Yuzi Kanazawa\",\"Akira Sakai\"]","published":"2025-12-22T19:26:59Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\",\"cs.DC\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
