Data-driven methods have gained popularity for addressing the monocular depth estimation (MDE) task. Among them, supervised methods, which yield state-of-theart (SOTA) results, require large amounts of labeled training data, posing challenges in terms of costly ground-truth label collection. This work aims to enhance the performance of existing supervised learning-based MDE methods by generating a substantial number of virtual views as additional supervision signals, circumventing the laborious and time-consuming process of collecting extra data. We propose leveraging the capabilities of Neural Implicit Surface Reconstruction (NISR) techniques to augment an existing limited-scale dataset by generating novel scene perspectives and corresponding high-quality depth maps. Experimental results demonstrate that the augmented dataset significantly boosts the performance of supervised-learning MDE networks. This highlights the potential of the NISR approach for scaling small-scale datasets and provides a valuable solution to further improve the efficacy of existing supervised MDE models without the need for an expensive label collection process.
We first generate a small-scale dataset which contains RGB-depth pairs by rendering the Replica dataset.
Then, we obtain the normals corresponding to the RGB-D images using a pre-trained Omnidata model. With the RGB images as well as the depth and normal cues, we employ MonoSDF as the NISR method and reconstruct each scene in the small-scale dataset. After reconstruction, we generate novel views, i.e. virtual RGB-D images for each scene from the trained MonoSDFs.
The original data is combined with the virtual views to form large-scale datasets. We train the same MDE network on the original data, which yields the baseline results, and on the augmented datasets to verify the effectiveness of the proposed method.
The depth prediction results on the test set evaluated using different metrics are shown in the table below, which demonstrate the effectiveness of training MDE with virtual views as additional supervisory signals. An example of qualitative results can be seen in the figure at the beginning.