VioLA: Aligning Videos to 2D LiDAR Scans

Jun-Jee
Chao*

Selim
Engin*

Nikhil
Chavan-Dafle

Bhoram
Lee

Volkan
Isler

* indicates equal contribution

Samsung AI Center - New York

Paper

We introduce VioLA, a method for aligning user-captured videos (either RGB-D or posed RGB) to 2D LiDAR maps. VioLA allows augmenting LiDAR maps with semantics, as well as registering multiple independent videos to each other using the LiDAR map as a common coordinate frame.

We highlight that with VioLA, we can register videos taken from different users, days and locations. The animation above shows the registration process of several videos recorded by a smartphone.

VioLA Overview

VioLA aligns an RGB-D video (or an RGB video tagged with poses) recorded from a local section of a scene to the 2D LiDAR map of the entire environment. After registration, the LiDAR map can be augmented with texture, 3D geometry and semantics extracted from the videos.

VioLA starts with building a semantic map of the local scene from the image sequence, then extracts points at a fixed height for registering to the LiDAR map. Due to reconstruction errors or partial coverage of the camera scan, the reconstructed semantic map may not contain sufficient information for registration. To address this problem, VioLA makes use of a pre-trained text-to-image inpainting model paired with a depth completion model for filling in the missing scene content in a geometrically consistent fashion to support pose registration.

VioLA can be used to fuse 3D reconstructions obtained over multiple scans by registering the videos to the same LiDAR map.

The following video showcases an application of VioLA and demonstrates its capabilities:

We found that reconstructed points at the height of the LiDAR scan are critical for registration success. However, these points might be missing due to the video not capturing the lower part of the scene or the SLAM algorithm suffering from matching featureless points. To provide this missing information, we proposed a strategy for selecting virtual viewpoints and a scene completion module that performs inpainting and 3D lifting from the chosen viewpoints. We evaluated VioLA on two real-world RGB-D benchmarks, as well as a self-captured dataset of a large office scene. Notably, our proposed scene completion module improves the pose registration performance by up to 20%.

In the animations below, we show 1) the reconstruction from an RGB-D image sequence taken from the Redwood dataset, 2) completed point cloud using VioLA's scene completion module that grounds the floor to estimated floor surface, and 3) scene completion without floor grounding.

Dense reconstruction of an input video

Completed scene geometry with floor grounding

Completed scene geometry without grounding

VioLA Method Details

BibTeX

@inproceedings{,
    title={{VioLA: Aligning Videos to 2D LiDAR Scans}},
    author={Jun-Jee Chao and Selim Engin and Nikhil Chavan-Dafle and Bhoram Lee and Volkan Isler},
    booktitle={{arXiv pre-print: 2311.04783}},
    year={2023}
  }