paint-brush
ZeroShape: Here's How We Did Our Data Curationby@fewshot
New Story

ZeroShape: Here's How We Did Our Data Curation

by The FewShot Prompting Publication 2mDecember 31st, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We use all the 55 categories of ShapeNetCore.v2 and over 1000 categories from the Objaverse-LVIS subset. We generate 1 to 20 images per object, scaled inversely from the number of meshes in the category of the object. We use Blender to generate synthetic images from the 3D meshes, and to extract a variety of useful annotations.
featured image - ZeroShape: Here's How We Did Our Data Curation
The FewShot Prompting Publication  HackerNoon profile picture
0-item

Abstract and 1 Introduction

2. Related Work

3. Method and 3.1. Architecture

3.2. Loss and 3.3. Implementation Details

4. Data Curation

4.1. Training Dataset

4.2. Evaluation Benchmark

5. Experiments and 5.1. Metrics

5.2. Baselines

5.3. Comparison to SOTA Methods

5.4. Qualitative Results and 5.5. Ablation Study

6. Limitations and Discussion

7. Conclusion and References


A. Additional Qualitative Comparison

B. Inference on AI-generated Images

C. Data Curation Details

4. Data Curation

4.1. Training Dataset

We use all the 55 categories of ShapeNetCore.v2 [6] for a total of about 52K meshes, as well as over 1000 categories from the Objaverse-LVIS [9] subset. This subset of Objaverse has been manually filtered by crowd workers to primarily include meshes of objects instead of other assets like scans of large scenes and buildings. After filtering Objaverse-LVIS to remove objects with minimal geometry (e.g. objects consisting of a single plane) this dataset has 42K meshes. Pooling these two data sources gives us a total of over 90K 3D object meshes from over 1000 categories.


We use Blender [41] to generate synthetic images from the 3D meshes, and to extract a variety of useful annotations: depth maps, camera intrinsics, and object and camera pose. Because the object distribution of ShapeNet is highly skewed (67% of data is 7 categories), we generate 1 to 20 images per object, scaled inversely from the number of meshes in the category of the object, resulting in a total of 159K images. For Objaverse we generate 25 images per object resulting in 939K images. Our traning set consists of slightly less than 1.1M images.


We generate images with varying focal lengths, from 30mm to 70mm for a 35mm image sensor size equivalent. We generate diverse object-camera geometry: rather than the common approach of always pointing the camera at the middle of the object at a fixed distance, we vary the object camera distance and vary the LookAt point of the camera. This allows us to capture a wide range of variability in how 3D shape projects to 2D. We follow the convention to use center cropped and foreground segmented images for training and testing. We provide more details in the supplement.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Zixuan Huang, University of Illinois at Urbana-Champaign and both authors contributed equally to this work;

(2) Stefan Stojanov, Georgia Institute of Technology and both authors contributed equally to this work;

(3) Anh Thai, Georgia Institute of Technology;

(4) Varun Jampani, Stability AI;

(5) James M. Rehg, University of Illinois at Urbana-Champaign.