{"ID":2851238,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.20639","arxiv_id":"2510.20639","title":"Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging","abstract":"Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D","short_abstract":"Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthe...","url_abs":"https://arxiv.org/abs/2510.20639","url_pdf":"https://arxiv.org/pdf/2510.20639v1","authors":"[\"Ibrahim Ethem Hamamci\",\"Sezgin Er\",\"Suprosanna Shit\",\"Hadrien Reynaud\",\"Dong Yang\",\"Pengfei Guo\",\"Marc Edgar\",\"Daguang Xu\",\"Bernhard Kainz\",\"Bjoern Menze\"]","published":"2025-10-23T15:13:13Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":607881,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2851238,"paper_url":"https://arxiv.org/abs/2510.20639","paper_title":"Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging","repo_url":"https://github.com/ibrahimethemhamamci/BTB3D","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
