RIC takes in a single RGB-D image and rotates it, inpaints missing areas, and completes the depth.

We introduce Rotate-Inpaint-Complete (RIC), a method for scene reconstruction that works by structurally breaking the problem into two steps: rendering novel views via inpainting and 2D to 3D scene lifting. Specifically, we leverage the generalization capability of large visual language models (Dalle-2) to inpaint the missing areas of scene color images rendered from different views. Next, we lift these inpainted images to 3D by predicting normals of the inpainted image and solving for the missing depth values. By predicting for normals instead of depth directly, our method allows for robustness to changes in depth distributions and scale.

We show that our method outperforms multiple baselines while providing generalization to novel objects and scenes. With rigourous quantitative evaluation on novel scenes with muliple unknown objects with many instances of heavy occlusion, we show our methods ability to reconstruct both geometry and texture in a realistic manner.

Above we show qualitative results of our method on the HOPE dataset. We also release the code to demonstrate the usefulness of our approach and allow for further research in field of scene reconstruction.

Video





RIC Method Overview

RIC takes as input an RGB-D image and starts by rendering incomplete RGB-D images Ii and Di from a new viewpoint Ti. The missing RGB values of Ii are inpainted using a diffusion-based VLM given a generated prompt, such as “a photo of household objects on a table”, where the pixels to be inpainted are determined by our Surface-Aware Masking (SAM) technique. The inpainted image is used to predict surface normals and occlusion boundaries at the new viewpoint Ti, which are then used for completing the missing depth values along with the incomplete depth image Di. After repeating this process for V viewpoints, the final output of RIC is a merge of deprojected depth predictions

Method



Our Surface-Aware Masking algorithm or SAM

Our Surface-Aware Masking (SAM) algorithm allows us to map unknown 3D areas into 2D accurately. This allows us to then inpaint more accurately which in turn lets us estimate the geometry and texture of the unknown objects in the scene.


Viewpoint Selection Method

We search for viewpoints by traversing directions along a sphere around the scene away from the original viewpoint. We look for viewpoints using our context ratio (context pixels / all pixels) and use the viewpoint that provides inpainting with enough context to work accurately but still provide new information. We do this for many different directions, and then apply our constistency filtering to obtain our final output.




Qualitative Novel View Results

Our method also has the capability to produce realistic novel views of multiple objects. Using the Rotation + SAM + Inpaint parts of our method, we show qualitative results of our methods ability to generate novel views given an RGB-D image with the YCB-V Dataset.

Method



BibTeX

@misc{kasahara2023ric,
        title={RIC: Rotate-Inpaint-Complete for Generalizable Scene Reconstruction}, 
        author={Isaac Kasahara and Shubham Agrawal and Selim Engin and Nikhil Chavan-Dafle and Shuran Song and Volkan Isler},
        year={2023},
        eprint={2307.11932},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }
}