{"ID":2878939,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.17400","arxiv_id":"2508.17400","title":"Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs","abstract":"How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.","short_abstract":"How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predi...","url_abs":"https://arxiv.org/abs/2508.17400","url_pdf":"https://arxiv.org/pdf/2508.17400v1","authors":"[\"Jacob Portes\",\"Connor Jennings\",\"Erica Ji Yuen\",\"Sasha Doubov\",\"Michael Carbin\"]","published":"2025-08-24T15:19:24Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.IR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}