IEEE T-RO | Flying Co-Stereo: Enabling Long-Range Aerial Dense Mapping via Collaborative Stereo Vision of Dynamic-Baseline

Recently, the team led by Prof. Wei Dong from Shanghai Jiao Tong University, in collaboration with Prof. Xingxing Zuo from Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), published a paper titled “Flying Co-Stereo: Enabling Long-Range Aerial Dense Mapping via Collaborative Stereo Vision of Dynamic-Baseline” in IEEE Transactions on Robotics. This work presents a flying collaborative stereo vision system in which two UAVs form a wide-baseline configuration to enable long-range dense 3D mapping. The proposed system achieves dense reconstruction at distances of up to 70 meters, with a relative error ranging from 2.3% to 9.7%.

NOKOV motion capture system provides high-precision ground-truth pose data to validate the proposed relative pose estimation algorithm.

Background

For UAVs operating in large-scale unknown environments, long-range perception is essential for safe navigation. Compared with LiDAR systems, stereo cameras offer advantages in terms of cost-effectiveness and lightweight design. However, conventional stereo cameras are constrained by short fixed baselines, which typically limit their perception range to within 20 meters. Existing wide-baseline systems are often too large to be deployed on small UAV platforms. Meanwhile, distributing stereo cameras across two dynamically flying UAVs introduces additional challenges, including dynamically varying baselines and difficulties in cross-view feature association.

System architecture of Flying Co-Stereo within our proposed CDBSM framework

Contributions

1) A Flying Co-Stereo system is proposed, in which two collaborative UAVs form a wide-baseline, cross-agent stereo vision setup within a unified CDBSM framework, enabling long-range dense mapping in large-scale unknown environments.

2) A DS-VIRE is developed to achieve robust and accurate online estimation of the dynamic inter-UAV baseline in complex outdoor conditions.

3) A hybrid visual feature association strategy is designed, combining cross-agent deep matching with intra-agent feature tracking, to ensure real-time and persistent co-visible feature correspondences under varying viewpoints.

4) A sparse-to-dense depth recovery scheme is proposed, which refines dense monocular depth predictions using exponential fitting of long-range triangulated sparse landmarks for precise metric-scale mapping.

Experimental Validation

1. Dynamic Baseline Estimation

Experiments are conducted to evaluate the accuracy of relative pose estimation between the two UAVs in the Flying Co-Stereo system. The two UAVs autonomously fly synchronized circular trajectories in the East-North-Up (ENU) coordinate frame, with a baseline length of 3 m. The relative pose estimates from the proposed Dual-Spectrum Visual-Inertial-Ranging Estimator (DS-VIRE) are compared against two baseline methods: (1) a visual PnP-based method relying solely on inter-UAV observations, and (2) a VIO differencing method that derives the relative pose by subtracting the individual VIO poses of the two UAVs.

NOKOV motion capture system is employed to provide ground-truth relative poses as the evaluation benchmark.

Experiments for relative pose estimation of Flying Co-Stereo under NOKOV motion capture system

Experiments for relative pose estimation of Flying Co-Stereo under NOKOV motion capture system

Experimental results show that the DS-VIRE achieves a total mean absolute error (MAE) of 0.013 m for relative position estimation, significantly outperforming the visual PnP-based method (0.018 m) and the VIO differencing method (0.024 m). For relative orientation estimation, the MAE of yaw is 0.214°.

In addition, the robustness of dynamic baseline estimation is evaluated through real-world outdoor experiments under challenging conditions, including intense sunlight, complex background clutter, severe infrared noise, and long observation distances. Results demonstrate that the proposed Dual-Spectrum Marker-Based Visual Detection and Tracking (DS-MVDT) algorithm achieves a tracking success rate of over 96% across all scenarios, significantly surpassing the baseline method (YOLOv4-tiny + MOSSE), which ranges between 17% and 70%.

Experiments of DS-MVDT with challenges from intense sunlight, cluttered background, light noises, and remote observation.

2. Cross-Camera Feature Association Performance Evaluation

Experiments are conducted to compare the real-time performance of the proposed Guidance-Prediction SuperPoint-SuperGlue (GP-SS) algorithm against three baseline methods: the original SuperPoint-SuperGlue (SS), ORB, and SURF. Results show that GP-SS achieves a feature association frequency of nearly 30 Hz, substantially outperforming the SS baseline (13 Hz).

3. Collaborative Triangulation Accuracy Evaluation of Sparse Landmarks

Experiments are performed to evaluate the number and accuracy of reconstructed landmarks across different depth segments (0–10 m, 10–30 m, 30–50 m, 50–70 m). The results demonstrate that the proposed system maintains effective triangulation capability beyond 30 m, whereas the single-UAV approach fails to triangulate landmarks at such distances.

4. Long-Range Dense Mapping Performance Evaluation

Dense mapping experiments are conducted in multiple real-world and simulated environments. The proposed exponential fitting model is compared against quadratic and linear fitting models, as well as two advanced Multi-View Stereo (MVS) methods, SimpleRecon and MVSAnywhere. Experimental results show that the proposed system achieves dense mapping at distances of up to 70 m with a relative error ranging from 2.3% to 9.7%. Compared to conventional stereo cameras, the system achieves up to a 350% improvement in maximum perception range and up to a 450% increase in coverage area.

Long-range dense reconstruction experiments in outdoor environments and photorealistic simulation

Corresponding Authors

Wei Dong

Tenured associate professor at the School of Mechanical Engineering, Shanghai Jiao Tong University. His research focuses on multi-robot collaboration and active perception.

Xingxing Zuo

Tenured assistant professor in the Department of Robotics at Mohamed Bin Zayed University of Artificial Intelligence. His research interests include robotics, spatial intelligence, state estimation, and embodied intelligence.

At the upcoming ICRA 2026, Prof. Xingxing Zuo, together with international scholars, will organize the workshop titled “MM-SpatialAI: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding.”

NOKOV Motion Capture is a sponsor of this workshop. Researchers in related fields are welcome to participate and contribute to advancing multimodal spatial intelligence for robust navigation and open-world understanding.

The workshop homepage is https://xingxingzuo.github.io/MM-SpatialAI/.

workshop

Capturing Motion,
Crafting Stories

IEEE T-RO | Flying Co-Stereo: Enabling Long-Range Aerial Dense Mapping via Collaborative Stereo Vision of Dynamic-Baseline

1. Dynamic Baseline Estimation

NOKOV motion capture system is employed to provide ground-truth relative poses as the evaluation benchmark.

2. Cross-Camera Feature Association Performance Evaluation

3. Collaborative Triangulation Accuracy Evaluation of Sparse Landmarks

4. Long-Range Dense Mapping Performance Evaluation

Long-range dense reconstruction experiments in outdoor environments and photorealistic simulation

NOKOV Motion Capture Basketball Game Demo

Kung Fu Motion Capture Performance

Applications of motion capture systems in wire-driven continuum robot research

Applications of Motion Capture Systems for Robot Joint Displacement and Geometric Parameter Calibration

Capturing Motion, Crafting Stories

IEEE T-RO | Flying Co-Stereo: Enabling Long-Range Aerial Dense Mapping via Collaborative Stereo Vision of Dynamic-Baseline

1. Dynamic Baseline Estimation

NOKOV motion capture system is employed to provide ground-truth relative poses as the evaluation benchmark.

2. Cross-Camera Feature Association Performance Evaluation

3. Collaborative Triangulation Accuracy Evaluation of Sparse Landmarks

4. Long-Range Dense Mapping Performance Evaluation

Long-range dense reconstruction experiments in outdoor environments and photorealistic simulation

NOKOV Motion Capture Basketball Game Demo

Kung Fu Motion Capture Performance

Applications of motion capture systems in wire-driven continuum robot research

Applications of Motion Capture Systems for Robot Joint Displacement and Geometric Parameter Calibration

Capturing Motion,
Crafting Stories