Topological Planning with Transformers for Vision-and-Language Navigation

Topological Planning with Transformers
for Vision-and-Language Navigation

Kevin Chen
Stanford University
Junshen K. Chen
Stanford University
Jo Chuang
Stanford University
Marynel Vázquez
Yale University
Silvio Savarese
Stanford University

Paper

Abstract

Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.

Citation

@InProceedings{Chen_2021_CVPR,
    author    = {Chen, Kevin and Chen, Junshen K. and Chuang, Jo and Vazquez, Marynel and Savarese, Silvio},
    title     = {Topological Planning With Transformers for Vision-and-Language Navigation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {11276-11286}
}

Acknowledgements

This work was funded through the support of ONR MURI Award #W911NF-15-1-0479. We also thank Stanford Vision and Learning Lab members for their constructive feedback, including Eric Chengshu Li and Claudia Perez D’Arpino.
The website template was borrowed from Michaël Gharbi and RelMoGen.