{"ID":2842015,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.09893","arxiv_id":"2511.09893","title":"Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning","abstract":"Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\\pm$std over three seeds and include $95\\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\\_repeat\\_ngram\\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.","short_abstract":"Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluate...","url_abs":"https://arxiv.org/abs/2511.09893","url_pdf":"https://arxiv.org/pdf/2511.09893v1","authors":"[\"Zubia Naz\",\"Farhan Asghar\",\"Muhammad Ishfaq Hussain\",\"Yahya Hadadi\",\"Muhammad Aasim Rafique\",\"Wookjin Choi\",\"Moongu Jeon\"]","published":"2025-11-13T02:55:32Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Transformer\",\"Convolutional Neural Network\"]","has_code":false}