An Efficient and Multi-Modal Navigation System with One-Step World Model

1Department of Precision Instrument, Tsinghua University,

2Xiaomi Robotics Lab, Beijing, China

Task Demo

Our world-model-based framework enables multi-modal goal-conditioned navigation, including image, language, and point goals.

Abstract

Navigation is a fundamental capability for mobile robots. While the current trend is to use learning-based approaches to replace traditional geometry-based methods, existing end-to-end learning-based policies often struggle with 3D spatial reasoning and lack a comprehensive understanding of physical world dynamics. Integrating world models—which predict future observations conditioned on given actions—with iterative optimization planning offers a promising solution due to their capacity for imagination and flexibility. However, current navigation world models, typically built on pure transformer architectures, often rely on multi-step diffusion processes and autoregressive frame-by-frame generation. These mechanisms result in prohibitive computational latency, rendering real-time deployment impossible. To address this bottleneck, we propose a lightweight navigation world model that adopts a one-step generation paradigm and a 3D U-Net backbone equipped with efficient spatial-temporal attention. This design drastically reduces inference latency, enabling high-frequency control while achieving superior predictive performance. We also integrate this model into an optimization-based planning framework utilizing anchor-based initialization to handle multi-modal goal navigation tasks. Extensive closed-loop experiments in both simulation and real-world environments demonstrate our system's superior efficiency and robustness compared to state-of-the-art baselines.

System Architecture

Architecture

Our system introduces a shortcut-based one-step generation paradigm for navigation world model. Unlike traditional diffusion models that require expensive iterative denoising, our world model directly predicts a sequence of 11 future frames in a single step. We utilize a 3D U-Net backbone operating within a VAE latent space. To handle high-dimensional video data efficiently, we employ a hybrid CNN-Transformer architecture with decoupled spatial and temporal attention. Specifically, a window-based temporal attention mechanism allows the model to capture complex dynamics without the quadratic complexity of full global attention. We integrate this lightweight world model into a model-based planning framework using the Cross-Entropy Method (CEM). To ensure robust performance under limited computational budgets (small sample size), we propose an anchor-based initialization strategy. Instead of random sampling, this approach initializes candidate trajectories using fixed velocity priors, significantly improving the planner's efficiency and success rate across multi-modal tasks (Image, Language, and Point goals).

Simulation Experiments

We evaluate our method in Habitat-sim across different modalities: Image-Goal, Language-Goal, and Point-Goal navigation. The sequence below the video displays the world model's predictions corresponding to the current optimal planned trajectory.

Image-Goal Navigation

To implement image-goal navigation, we employ LPIPS as loss function to score sampled trajectories. Drawing inspiration from prior vision-based approaches, we utilize a topological memory system to achieve long-horizon navigation. Specifically, the process initiates with the first observation. At each time step, a distance estimation network identifies the nearest node in the topological graph to estimate the current location. Subsequently, the image of the adjacent node is fed into the policy system as the immediate target.


Language-Goal Navigation

To implement language-goal navigation, we employ SigLIP as the scoring function to evaluate sampled trajectories.


Point-Goal Navigation

To implement Point-Goal Navigation, we incorporate Depth-Anything to perform monocular depth estimation on the predicted future images. We formulate a composite planning loss function to evaluate candidate trajectories. Specifically, this loss is defined as the weighted sum of two terms: (1) the Euclidean distance between the terminal point of the sampled trajectory and the target coordinate, and (2) the negative mean depth value in the central region of the predicted image (to penalize proximity to obstacles).

Real-World Experiments

We deploy our model on a physical robot to test its robustness in real-world environments.

Image-Goal
Language-Goal
Point-Goal

BibTeX

@misc{shen2026efficientmultimodalnavigationonestep,
      title={An Efficient and Multi-Modal Navigation System with One-Step World Model}, 
      author={Wangtian Shen and Ziyang Meng and Jinming Ma and Mingliang Zhou and Diyun Xiang},
      year={2026},
      eprint={2601.12277},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.12277}, 
}