We propose a motion-based method to discover the physical parts of an articulated object class (e.g. head/torso/leg of a horse) from multiple videos. The key is to find object regions that exhibit consistent motion relative to the rest of the object, across multiple videos. We can then learn a location model for the parts and segment them accurately in the individual videos using an energy function that also enforces temporal and spatial consistency in part motion. Reasoning across videos allows to share information, e.g. we can discover the legs from videos of tigers walking and transfer them to videos of tigers just turning their head (and vice versa). Further, it establishes correspondences across videos: note how our method labeled the corresponding parts in the two videos below with the same color (e.g. head in brown, torso in white). This work is discussed in detail in our CVPR 2016 paper . It is also mentioned in this TechCrunch article.
We provide here the dataset and the ground-truth annotations used in this work. This consists of 16 videos of tigers and 16 of horses. For several frames in each video, we manually annotated the 2D location of the physical parts as illustrated below. These annotations are at the super-pixel level. They can be very useful in evaluating methods for motion segmentation or structure from motion. If you use this data, please cite our CVPR 2016 paper .
DOWNLOAD (429 Mb)
The video below provides an overview of our method
This work was partly funded by a Google Faculty Award, and by ERC Grant “Visual Culture for Image Understanding”.