Topological Planning with Transformers
for Vision-and-Language Navigation
- Kevin Chen Stanford University
- Junshen K. Chen Stanford University
- Jo Chuang Stanford University
- Marynel Vázquez Yale University
- Silvio Savarese Stanford University
Abstract
Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.
Links
- Agent-generated maps (ZIP, 685MB) [Download]
- Traversability maps (ZIP, 360KB) [Download]
- Code [kevin.chen@cs.stanford.edu]
Citation
Acknowledgements
This work was funded through the support of ONR MURI Award #W911NF-15-1-0479. We also thank Stanford Vision and Learning Lab members for their constructive feedback, including Eric Chengshu Li and Claudia Perez D’Arpino.
The website template was borrowed from Michaël Gharbi and RelMoGen.