{"ID":3084802,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T02:49:24.740369373Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05635","arxiv_id":"2606.05635","title":"ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions","abstract":"Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \\textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \\textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \\textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \\textbf{2.82} times over GPT-5 in shot localization accuracy.","short_abstract":"Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different empha...","url_abs":"https://arxiv.org/abs/2606.05635","url_pdf":"https://arxiv.org/pdf/2606.05635v1","authors":"[\"Dehong Kong\",\"Lina Lei\",\"Lingtao Zheng\",\"Chenyang Wu\",\"Ailing Zhang\",\"Xinran Qin\",\"Teng Ma\",\"Jiaqi Xu\",\"Zhixin Wang\",\"Zhikai Chen\",\"Xuecheng Qi\",\"Renjing Pei\",\"Fan Li\"]","published":"2026-06-04T03:01:12Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.MM\"]","methods":"[\"Large Language Model\"]","has_code":false}
