LoTIS: Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Division of Robotics, Perception, and Learning at KTH Royal Institute of Technology

Abstract

We present LoTIS, a model for visual navigation that provides robot-agnostic image-space guidance by localizing a reference RGB trajectory in the robot's current view, without requiring camera calibration, poses, or robot-specific training. Instead of predicting actions tied to specific robots, we predict the image-space coordinates of the reference trajectory as they would appear in the robot's current view. This creates robot-agnostic visual guidance that easily integrates with local planning. Consequently, our model's predictions provide guidance zero-shot across diverse embodiments. By decoupling perception from action and learning to localize trajectory points rather than imitate behavioral priors, we enable a cross-trajectory training strategy that learns robust invariance to viewpoint and camera changes. We outperform state-of-the-art methods by 20-50 percentage points in success rate on forward navigation, and paired with a local planner we achieve 94-98% success rate across diverse sim and real environments. Furthermore, we achieve over 5x improvements on challenging tasks where baselines fail, such as backward traversal. The system is straightforward to use: we show how even a video from a handheld phone camera directly enables different robots to navigate to any point on the trajectory.

Method

LoTIS method overview

LoTIS decouples perception from action by predicting where a reference trajectory appears in the robot's current view, rather than predicting robot-specific actions. For each frame in the reference trajectory, our model outputs: (1) the 2D image coordinates where that pose would appear, (2) whether it's visible, and (3) its normalized distance. This robot-agnostic representation interfaces directly with any local planner, enabling zero-shot transfer across embodiments, from drones to quadrupeds, using the same phone-recorded trajectory. A cross-trajectory training strategy, where reference and query images come from different trajectories, teaches robustness to camera mismatch and enables backward traversal where prior methods fail.

Kilometer-Scale Navigation

LoTIS scales to long-range outdoor trajectories.

BibTeX

@article{lotis2025,
  title={LoTIS: Learning to Localize Reference Trajectories in Image-Space},
  author={Anonymous},
  journal={Under Review},
  year={2025}
}