{"ID":2889868,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.20146","arxiv_id":"2507.20146","title":"AlignFreeNet: Is Cross-Modal Pre-Alignment Necessary? An End-to-End Alignment-Free Lightweight Network for Visible-Infrared Object Detection","abstract":"Cross-modal misalignments, such as spatial offsets, resolution discrepancies, and semantic deficiencies, frequently occur in visible-infrared object detection (VI-OD). To mitigate this, existing methods are typically adapted into an alignment-based fusion paradigm, in which an explicit pixel- or feature-level alignment module is inserted before cross-modal fusion. However, pixel-level alignment struggles to cope with severe or mixed misalignments, whereas feature-level alignment often introduces undesirable noise into fused representations under such conditions, ultimately limiting detection performance. In this paper, we propose a novel alignment-free network (AlignFreeNet) for VI-OD. Differing from prior methods, AlignFreeNet abandons any explicit alignment and instead adopts an alignment-free fusion paradigm. Specifically, AlignFreeNet comprises two core modules: variation-guided cross-modal compensation (VCC) and frequency-guided cross-modal fusion (FCF). VCC adaptively feeds the compensated information derived from cross-modal discrepancies back into each modality, enhancing visible and infrared representations without the noise caused by explicit alignment. FCF achieves robust cross-modal fusion by suppressing task-irrelevant redundancy via frequency-domain gating, effectively mitigating noise introduced in the process. Moreover, VCC and FCF jointly exploit low- and high-frequency cues to preserve foreground contours in fused representations, effectively mitigating cross-modal blending caused by severe mixed misalignments. Extensive evaluations on DVTOD, M3FD, and DroneVehicle demonstrate that our AlignFreeNet achieves state-of-the-art performance under severe mixed misalignment conditions, highlighting its robustness and generalization.","short_abstract":"Cross-modal misalignments, such as spatial offsets, resolution discrepancies, and semantic deficiencies, frequently occur in visible-infrared object detection (VI-OD). To mitigate this, existing methods are typically adapted into an alignment-based fusion paradigm, in which an explicit pixel- or feature-level alignment...","url_abs":"https://arxiv.org/abs/2507.20146","url_pdf":"https://arxiv.org/pdf/2507.20146v2","authors":"[\"Dingkun Zhu\",\"Haote Zhang\",\"Lipeng Gu\",\"Wuzhou Quan\",\"Fu Lee Wang\",\"Honghui Fan\",\"Jiali Tang\",\"Haoran Xie\",\"Xiaoping Zhang\",\"Mingqiang Wei\"]","published":"2025-07-27T06:53:31Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}