{"ID":2856099,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13860","arxiv_id":"2510.13860","title":"ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models","abstract":"While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, particularly in the attention sub-layers in the top layers, presenting opportunities for optimization without compromising performance. Taking insights from research on inference-time layer pruning and depth-dependent computation in language models, we introduce an efficient language model architecture referred to as ShishuLM. By replacing full decoder layers at the top of the model with MLP-only blocks, we achieve up to 10-60% improvement in generation latency and 1.3 -5 $\\times$ gain in throughput. Upon further sharing parameters across adjacent MLP-only layers of ShishuLM, we obtain up to 20% savings in memory with minimal degradation in performance. Our findings provide insights towards building more efficient language modeling architectures from a pre-training standpoint by leveraging how information flows in transformers.","short_abstract":"While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, particularly in the attention sub-layers in the...","url_abs":"https://arxiv.org/abs/2510.13860","url_pdf":"https://arxiv.org/pdf/2510.13860v2","authors":"[\"Shivanshu Kumar\",\"Gopalakrishnan Srinivasan\"]","published":"2025-10-13T04:04:54Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}