{"ID":2880430,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.15099","arxiv_id":"2508.15099","title":"Hydra: A Modular Architecture for Efficient Long-Context Reasoning","abstract":"The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-of-experts, and dual memories comprising a reasoning workspace and product key memory. We evaluate a 29M parameter model measuring logical chaining accuracy and throughput on synthetic sequences, plus throughput on WikiText. Ablation studies use component-specific synthetic datasets to isolate individual mechanisms. Hydra achieves $3.01\\times$ and $3.0\\times$ throughput gains at 8K tokens for synthetic and WikiText datasets, respectively, and $10\\times$ accuracy improvements on multi-step logical composition compared to equal-sized transformers. Ablations confirm each component's contribution: sparse attention captures long-range dependencies, experts specialize to input domains, and product key memory enables selective retrieval.","short_abstract":"The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-...","url_abs":"https://arxiv.org/abs/2508.15099","url_pdf":"https://arxiv.org/pdf/2508.15099v3","authors":"[\"Siddharth Chaudhary\",\"Dev Patel\",\"Maheep Chaudhary\",\"Bennett Browning\"]","published":"2025-08-20T22:31:15Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"stat.ML\"]","methods":"[\"Transformer\"]","has_code":false}