Predictive Beamforming in Low-Altitude Wireless Networks: A Cross-Attention Approach
Abstract
Accurate beam prediction is essential for maintaining reliable links and high spectral efficiency in dynamic low-altitude wireless networks. However, existing approaches often fail to capture the deep correlations across heterogeneous sensing modalities, limiting their adaptability in complex three-dimensional environments. To overcome these challenges, we propose a multi-modal predictive beamforming method based on a cross-attention fusion mechanism that jointly leverages visual and structured sensor data. The proposed model utilizes a Convolutional Neural Network (CNN) to learn multi-scale spatial feature hierarchies from visual images and a Transformer encoder to capture cross-dimensional dependencies within sensor data. Then, a cross-attention fusion module is introduced to integrate complementary information between the two modalities, generating a unified and discriminative representation for accurate beam prediction. Through experimental evaluations conducted on a real-world dataset, our method reaches 79.7% Top-1 accuracy and 99.3% Top-3 accuracy, surpassing the 3D ResNet-Transformer baseline by 4.4%-23.2% across Top-1 to Top-5 metrics. These results verify that multi-modal cross-attention fusion is effective for intelligent beam selection in dynamic low-altitude wireless networks.