Humans develop a common-sense understanding of the physical behaviour of the world, within the first year of their life. For example, we are able to identify 3D objects in a scene and infer their geometric and physical properties, even when only parts of these objects are visible. It has long been hypothesized that the human visual system processes the vast amount of raw visual input into compact parsimonious representations, where complex objects are decomposed into a small number of atomic elements (primitives) that can each be represented using low-dimensional descriptions. In the early days of computer vision, researches explored various shape primitives that could potentially mimic the human's perception such as 3D polyhedrals, generalized cylinders, geons and superquadrics. However, it proved very difficult to extract such representations from images due to the lack of computational resources and training data at that time.
Recently, shape primitives have been revisited in the context of deep learning. Primitive-based representations provide an interpretable alternative towards traditional shape extraction methods that do not take into consideration the constituent parts of the target object.