{"ID":2885894,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.04612","arxiv_id":"2508.04612","title":"A Reproducible, Scalable Pipeline for Synthesizing Autoregressive Model Literature","abstract":"The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies -- AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an autoregressive music model on the Lakh MIDI dataset -- confirm that the extracted settings support faithful reproduction, achieving test perplexities within 1--3% of the original reports.","short_abstract":"The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, fil...","url_abs":"https://arxiv.org/abs/2508.04612","url_pdf":"https://arxiv.org/pdf/2508.04612v1","authors":"[\"Faruk Alpay\",\"Bugra Kilictas\",\"Hamdi Alakkad\"]","published":"2025-08-06T16:33:20Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.DL\",\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false}
