{"ID":2829616,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13727","arxiv_id":"2512.13727","title":"RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing","abstract":"Ride-hailing platforms face the challenge of balancing passenger waiting times with overall system efficiency under highly uncertain supply-demand conditions. Adaptive delayed matching, which controls the holding intervals for batched sets of requests and vehicles, reveals an inherent trade-off between matching and pickup delays. The resulting environment with temporally varying request arrival patterns and dynamic congestion calls for more expressive networks with sufficient capacity to capture their non-stationarity. To address the limitations of existing methods that rely on shallow encoders that cannot capture dynamic supply-demand patterns and congestion effects, we introduce the Regime-Aware Spatio-Temporal Mixture-of-Experts (RAST-MoE) framework, which formalizes adaptive delayed matching as a regime-aware Markov Decision Process and equips RL agents with a self-attention MoE encoder. Instead of relying on a single monolithic network, our design allows different experts to specialize automatically in varying operational conditions, improving representation capacity while maintaining per-sample computation efficiency. Despite its modest size of only 12M parameters, our framework consistently outperforms strong baselines. On real-world Uber trajectory data from San Francisco, it reduces average matching delay by 10%, and pickup delay by 15%. In addition, it demonstrates robustness to unseen demand regimes, stable training behavior without reward hacking, and expert specialization to different regimes. This study shows the strength of MoE-enhanced RL for large-scale decision-making tasks with complex spatiotemporal dynamics.","short_abstract":"Ride-hailing platforms face the challenge of balancing passenger waiting times with overall system efficiency under highly uncertain supply-demand conditions. Adaptive delayed matching, which controls the holding intervals for batched sets of requests and vehicles, reveals an inherent trade-off between matching and pic...","url_abs":"https://arxiv.org/abs/2512.13727","url_pdf":"https://arxiv.org/pdf/2512.13727v2","authors":"[\"Yuhan Tang\",\"Kangxin Cui\",\"Jung Ho Park\",\"Yibo Zhao\",\"Xuan Jiang\",\"Haoze He\",\"Jiangbo Yu\",\"Haris Koutsopoulos\",\"Jinhua Zhao\"]","published":"2025-12-13T20:49:15Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
