{"ID":2848384,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.25164","arxiv_id":"2510.25164","title":"Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning","abstract":"We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.","short_abstract":"We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textua...","url_abs":"https://arxiv.org/abs/2510.25164","url_pdf":"https://arxiv.org/pdf/2510.25164v2","authors":"[\"Yogesh Thakku Suresh\",\"Vishwajeet Shivaji Hogale\",\"Luca-Alexandru Zamfira\",\"Anandavardhana Hegde\"]","published":"2025-10-29T04:49:20Z","proceeding":"eess.IV","tasks":"[\"eess.IV\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false}