{"ID":2830403,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.10932","arxiv_id":"2512.10932","title":"BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models","abstract":"Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.","short_abstract":"Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretra...","url_abs":"https://arxiv.org/abs/2512.10932","url_pdf":"https://arxiv.org/pdf/2512.10932v2","authors":"[\"Shengao Wang\",\"Wenqi Wang\",\"Zecheng Wang\",\"Max Whitton\",\"Michael Wakeham\",\"Arjun Chandra\",\"Joey Huang\",\"Pengyue Zhu\",\"Helen Chen\",\"David Li\",\"Jeffrey Li\",\"Shawn Li\",\"Andrew Zagula\",\"Amy Zhao\",\"Andrew Zhu\",\"Sayaka Nakamura\",\"Yuki Yamamoto\",\"Jerry Jun Yokono\",\"Aaron Mueller\",\"Bryan A. Plummer\",\"Kate Saenko\",\"Venkatesh Saligrama\",\"Boqing Gong\"]","published":"2025-12-11T18:57:05Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false}