paint-brush
ZeroShape: The Training Dataset That We Usedby@fewshot
New Story

ZeroShape: The Training Dataset That We Used

by The FewShot Prompting Publication 3mDecember 31st, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We use three different real-world dataset evaluation: OmniObject3D, Ocrtoc3D and Pix3D. We use Blender and HDR environment maps to generate realistic images with diverse lighting. We randomly sample camera viewpoint, distance and focal length. We follow the split of the split and use 1181 images to create an easy-to-use benchmark format.
featured image - ZeroShape: The Training Dataset That We Used
The FewShot Prompting Publication  HackerNoon profile picture
0-item

Abstract and 1 Introduction

2. Related Work

3. Method and 3.1. Architecture

3.2. Loss and 3.3. Implementation Details

4. Data Curation

4.1. Training Dataset

4.2. Evaluation Benchmark

5. Experiments and 5.1. Metrics

5.2. Baselines

5.3. Comparison to SOTA Methods

5.4. Qualitative Results and 5.5. Ablation Study

6. Limitations and Discussion

7. Conclusion and References


A. Additional Qualitative Comparison

B. Inference on AI-generated Images

C. Data Curation Details

4.2. Evaluation Benchmark

We use three different real-world dataset evaluation: OmniObject3D [66], Ocrtoc3D [51], and Pix3D [52]. Because our testing images images come from the real world, or are renders of real 3D object scans distinct from our training set, they are a good test set for zero-shot generalization.


OmniObject3D. OmniObject3D is a large and diverse dataset of 3D scans and videos of objects from 216 categories, including household objects and products, food and toys. Because the foreground segmentations are noisy, we follow convention and render the 3D scans to generate test images [30, 31]. We improve the default material shader which generates glass-like surface appearance to appear more natural. We use Blender and HDR environment maps to generate realistic images with diverse lighting. We randomly sample camera viewpoint, distance and focal length.


Ocrtoc3D. Ocrtoc3D is a real-world object dataset that contains object-centric videos and full 3D annotations from 15 coarse categories. Some coarse categories contain many subcategories (e.g. toy animals contain various species). For each video the mesh (3D scan) and the viewpoint information are provided. We clean up this dataset by manually removing outliers (e.g. empty meshes/wrong object scales) and use the full filtered dataset consisting of 749 unique image-object pairs.


Pix3D. Pix3D is a real-world object dataset that contains 3D annotations from 9 categories. For each image in this dataset, an object mask, a CAD model, and the input viewpoint information are provided. These 3D annotations come from manual alignment between shapes and images. We follow the split of [19] and use 1181 images.


Benchmark curation. To create an easy-to-use benchmark, we convert the three heterogeneous datasets into a unified format. This includes aligning and converting the camera intrinsics and extrinsics, and object poses, to a standardized convention across the test datasets and our synthetic dataset. This is often a tedious obstacle in 3D vision research. We also organize images, masks and other metadata in a standardized manner. The release of our training data, data gen


Figure 5. Qualitative results. We compare ZeroShape to other SOTA methods on our curated benchmark (first three columns are from Ocrtoc3D [51], last three are from OmniObject3D [66]). Our reconstruction not only better aligns with the visible surfaces from images, but also recovers a faithful global structure of the reconstructed objects.


erating pipeline, and benchmark will benefit the community by providing a unified setup for large scale training on synthetic data and large scale testing on real data.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;

(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;

(3) Anh Thai, Georgia Institute of Technology;

(4) Varun Jampani, Stability AI;

(5) James M. Rehg, University of Illinois at Urbana-Champaign.