MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

Sep 22, 2025 cs.CV arXiv:2509.18473

Abstract

Standard video action recognition models often process typically resized full frames, suffering from spatial redundancy and high computational costs. To address this, we introduce MoCrop, a motion-aware adaptive cropping module designed for efficient video action recognition in the compressed domain. Leveraging Motion Vectors (MVs) naturally available in H.264 video, MoCrop localizes motion-dense regions to produce adaptive crops at inference without requiring any training or parameter updates. Our lightweight pipeline synergizes three key components: Merge & Denoise (MD) for outlier filtering, Monte Carlo Sampling (MCS) for efficient importance sampling, and Motion Grid Search (MGS) for optimal region localization. This design allows MoCrop to serve as a versatile "plug-and-play" module for diverse backbones. Extensive experiments on UCF101 demonstrate that MoCrop serves as both an accelerator and an enhancer. With ResNet-50, it achieves a +3.5% boost in Top-1 accuracy at equivalent FLOPs (Attention Setting), or a +2.4% accuracy gain with 26.5% fewer FLOPs (Efficiency Setting). When applied to CoViAR, it improves accuracy to 89.2% or reduces computation by roughly 27% (from 11.6 to 8.5 GFLOPs). Consistent gains across MobileNet-V3, EfficientNet-B1, and Swin-B confirm its strong generality and suitability for real-time deployment. Our code and models are available at https://github.com/microa/MoCrop.

Abstract

PDF Viewer