Superquadrics Revisited:
Learning 3D Shape Parsing beyond Cuboids

Despoina Paschalidou 1,4 Ali Osman Ulusoy 2 Andreas Geiger 1,3,4
1 Autonomous Vision Group, MPI for Intelligent Systems Tübingen 2 Microsoft
3 University of Tübingen 4 Max Planck ETH Center for Learning Systems
CVPR 2019

Paper

Code

Video

Poster

Blog
Superquadrics allow for modeling fine details such as the tails and ears of animals as well as the wings and the body of the airplanes and wheels of the motorbikes which are hard to capture using cuboids
Abstracting complex 3D shapes with parsimonious part-based representations has been a long standing goal in computer vision. This paper presents a learning-based solution to this problem which goes beyond the traditional 3D cuboid representation by exploiting superquadrics as atomic elements. We demonstrate that superquadrics lead to more expressive 3D scene parses while being easier to learn than 3D cuboid representations. Moreover, we provide an analytical solution to the Chamfer loss which avoids the need for computational expensive reinforcement learning or iterative prediction. Our model learns to parse 3D objects into consistent superquadric representations without supervision. Results on various ShapeNet categories as well as the SURREAL human body dataset demonstrate the flexibility of our model in capturing fine details and complex poses that could not have been modelled using cuboids.
Approach Overview
We propose a novel deep neural network that efficiently learns to parse 3D objects into consistent superquadric representations, without any part-level supervision, conditioned on a 3D shape or 2D image as an input. In particular, our network encodes the input image/shape into a low-dimensional primtive representation \(\mathbf{P}=\{(\lambda_m, \gamma_m)\}_{m=1}^M\), where \(M\) is an upper bound to the maximum number of primitives. For each primitive our network regresses: We represent the target pointcloud as a set of 3D points \(\mathbf{X}=\{\mathbf{x_i}\}_{i=1}^N\) and approximate the surface of each primitive \(m\) using a set of points \(\mathbf{Y}_m = \{\mathbf{y}_k^m\}_{k=1}^K\) uniformly sampled on the surface of the primitive. To train our network, we measure the discrepancy between the target and the predicted shape. In particular, we formulate our optimization objective as a bi-directional reconstruction loss and incorporate a minimum description length prior, which favors parsimony. In particular, our overall loss is: $$ \begin{equation*} \mathcal{L}_{D}(\mathbf{P} , \mathbf{X}) = \underbrace{\mathcal{L}_{P\rightarrow X}(\mathbf{P}, \mathbf{X})}_{ \substack{\text{Primitive-to-Pointcloud}}} + \underbrace{\mathcal{L}_{X\rightarrow P}(\mathbf{X}, \mathbf{P})}_{ \substack{\text{Pointcloud-to-Primitive}}} + \underbrace{\mathcal{L}_{\gamma}(\mathbf{P})}_{ \substack{\text{Parsimony}}} \end{equation*} $$
Are superquadrics better than cuboids?
We visualize the qualitative evolution of superquadrics (top) and cuboids (bottom) during training. Superquadrics converge faster to more accurate representations, whereas cuboids cannot capture details such as the open mouth of the dog, even after convergence. This is also validated quantitatively, where for any given number of primitives superquadrics consistently achieve a lower loss and hence a higher modeling fidelity.
Evaluation on ShapeNet
We evaluate the quality of the predicted primitives on the ShapeNet dataset. We train a model per-object category using maximally 20 primitives. We associate every primitive with a unique color, thus primitives illustrated with the same color correspond to the same object part.
We visualize predictions for the object categories animals, aeroplane and chairs from the ShapeNet dataset. The top row illustrates the ground-truth meshes from every object. The middle row depicts the corresponding predictions using the cuboidal primitives estimated by Tulsiani et al. The bottom row shows the corresponding predictions using our learned superquadric surfaces. We observe that the predicted primitive representations are consistent across instances. For example, the primitive depicted in green describes the right wing of the aeroplane, while for the animals class, the yellow primitive describes the front legs of the animal.
Evaluation on SURREAL
Our network learns semantic mappings of body parts across different body shapes and articulations. The benefits of superquadrics over simpler shape abstractions are accentuated in this dataset due to the complicated shapes of the human body. Our model predicts pointy octahedral shapes for the feet, ellipsoid shapes for the head and a flattened elongated superellipsoid for the main body without any supervision on the primitive parameters. Another interesting aspect of our model is the consistency of the predicted primitives, i.e., the same primitives (highlighted with the same color) consistently represent feet, legs, arms etc. across different pose.
Acknowledgements
We thank Michael Black for early discussions on superquadrics. This research was supported by the Max Planck ETH Center for Learning Systems.