RIC: Rotate-Inpaint-Complete for Generalizable Scene Reconstruction

Isaac
Kasahara

Shubham
Agrawal

Selim
Engin

Nikhil
Chavan-Dafle

Shuran
Song

Volkan
Isler

Samsung AI Center - New York

arXiv page arXiv pdf Code

RIC takes in a single RGB-D image and rotates it, inpaints missing areas, and completes the depth.

We introduce Rotate-Inpaint-Complete (RIC), a method for scene reconstruction that works by structurally breaking the problem into two steps: rendering novel views via inpainting and 2D to 3D scene lifting. Specifically, we leverage the generalization capability of large visual language models (Dalle-2) to inpaint the missing areas of scene color images rendered from different views. Next, we lift these inpainted images to 3D by predicting normals of the inpainted image and solving for the missing depth values. By predicting for normals instead of depth directly, our method allows for robustness to changes in depth distributions and scale.

We show that our method outperforms multiple baselines while providing generalization to novel objects and scenes. With rigourous quantitative evaluation on novel scenes with muliple unknown objects with many instances of heavy occlusion, we show our methods ability to reconstruct both geometry and texture in a realistic manner.

Above we show qualitative results of our method on the HOPE dataset. We also release the code to demonstrate the usefulness of our approach and allow for further research in field of scene reconstruction.

Video

RIC Method Overview

RIC takes as input an RGB-D image and starts by rendering incomplete RGB-D images I_i and D_i from a new viewpoint T_i. The missing RGB values of I_i are inpainted using a diffusion-based VLM given a generated prompt, such as “a photo of household objects on a table”, where the pixels to be inpainted are determined by our Surface-Aware Masking (SAM) technique. The inpainted image is used to predict surface normals and occlusion boundaries at the new viewpoint T_i, which are then used for completing the missing depth values along with the incomplete depth image D_i. After repeating this process for V viewpoints, the final output of RIC is a merge of deprojected depth predictions

Our Surface-Aware Masking algorithm or SAM

Our Surface-Aware Masking (SAM) algorithm allows us to map unknown 3D areas into 2D accurately. This allows us to then inpaint more accurately which in turn lets us estimate the geometry and texture of the unknown objects in the scene.

Viewpoint Selection Method

We search for viewpoints by traversing directions along a sphere around the scene away from the original viewpoint. We look for viewpoints using our context ratio (context pixels / all pixels) and use the viewpoint that provides inpainting with enough context to work accurately but still provide new information. We do this for many different directions, and then apply our constistency filtering to obtain our final output.

Qualitative Novel View Results

Our method also has the capability to produce realistic novel views of multiple objects. Using the Rotation + SAM + Inpaint parts of our method, we show qualitative results of our methods ability to generate novel views given an RGB-D image with the YCB-V Dataset.

RIC: Rotate-Inpaint-Complete for Generalizable Scene Reconstruction

RIC takes in a single RGB-D image and rotates it, inpaints missing areas, and completes the depth.

Video

RIC Method Overview

Qualitative Novel View Results

BibTeX