Robust Computer Vision through Approximate-Analysis-by-Synthesis with Compositional Generative Networks
Adam Kortylewski, Postgraduate Researcher, Johns Hopkins University, Baltimore, USA
22 April 2021 | 13:00 BST
A critical problem for computer vision is that current machine learning approaches perform well when they are applied in scenarios that are familiar to them, but they fail to give reliable predictions in unseen or adverse viewing conditions. They are unreliable in real-world scenarios when objects are partially occluded, seen from a previously unseen pose, or in bad weather. This lack of robustness needs to be overcome to make computer vision a reliable component of Science and our everyday lives.
It has long been conjectured that vision should be formulated in terms of analysis-by-synthesis (ABS) which inverts the image formation process and hence estimates the 3D structure of the physical world. Nevertheless, computer vision researchers have found it very difficult to develop ABS methods that work reliably on real-world images. In this talk, I argue that the main limitation of past ABS approaches is that they aim to be generative on the level of image intensities, and hence model many object details that are irrelevant for most high-level vision tasks such as classification, detection, or pose estimation. I will introduce a new class of generative vision models that work on neural network features instead of image intensities. These models, which we term Compositional Generative Networks (CGNs), avoid modeling irrelevant object details and hence lead to an approximate-analysis-by-synthesis approach to computer vision. CGNs are simplified models which enable efficient learning and inference. They work as well as classic deep networks on standard vision tasks but significantly outperform them when tested on out-of-distribution tasks such as when objects are partially occluded or seen from a previously unobserved viewpoint. I believe that approximate-ABS has several potential benefits beyond robust vision, for example in enabling efficient learning, multi-tasking, and top-down reasoning.