{"ID":3004626,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-05T11:43:53.432517148Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03986","arxiv_id":"2606.03986","title":"NewtPhys: Do Foundation Models Understand Newtonian Physics?","abstract":"Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at https://astra-vision.github.io/NewtPhys.","short_abstract":"Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physi...","url_abs":"https://arxiv.org/abs/2606.03986","url_pdf":"https://arxiv.org/pdf/2606.03986v1","authors":"[\"Sebastian Cavada\",\"Soumava Paul\",\"Tuan-Hung Vu\",\"Andrei Bursuc\",\"Raoul de Charette\"]","published":"2026-06-02T17:59:12Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}