The Model 2.0:
An Anatomically-Inspired Model of the Primate Ventral Stream

Thursday 30 June 2022 

 Garrison Cottrell, University of California, San Diego

Over the last thirty or so years, my lab has used variants of a relatively simple biologically-inspired neurocomputational model of face and object recognition (The Model™) to explain a number of behavioral, developmental, and neurophysiological phenomena. These results include, for example, fits to data supporting both the categorical and continuous theories of facial expression perception (“one model to rule them all”), a novel explanation of hemispheric asymmetries (local and global perception of hierarchical stimuli), and my favorite result, why the fusiform face area is recruited for other domains of visual expertise. Here, I report on some results of The Model 2.0, a deep version of The Model that includes a foveated retina, the log-polar mapping from the visual field to V1, sampling from the image via a salience map, and dual pathways from V1, central and peripheral. First, I describe some previously reported results on how The Model 2.0 can explain behavioral data in human scene perception under scotoma and tunnel vision conditions (Wang & Cottrell, 2017). Second, I provide a novel explanation of the face inversion effect. Contrary to the generally accepted wisdom that this occurs deep in the visual stream, our hypothesis is that the face inversion effect can be accounted for by the representation in V1 combined with the reliance on the configuration of features in face recognition.

The log-polar mapping, when used as input to a convolutional neural network (CNN), provides two kinds of invariances. Scale is just a left-right shift in this representation (see images of Geoff Hinton (top row) and their log-polar representation (bottom row)). Similarly, rotation in the image plane is an up-down shift. Because CNNs are (somewhat) translation invariant, the network as a whole becomes scale and rotation invariant. However, translation invariance is lost. We make up for this by sampling from the image at multiple points, just as humans use multiple fixations to recognize a face (Hsiao & Cottrell, 2008). I end by explaining the puzzle of why a network that is rotation invariant shows a face inversion effect.

Short Bio: Garrison W. (Gary) Cottrell is a Professor of Computer Science and Engineering and the Director of the Interdisciplinary Ph.D. Program in Cognitive Science at UC San Diego. He was a founding PI of the Perceptual Expertise Network, and directed the Temporal Dynamics of Learning Center, an NSF-sponsored Science of Learning Center comprised of 40 PIs at 18 institutions in 4 countries. Professor Cottrell’s research is strongly interdisciplinary. His main interest is Cognitive Science and Computational Cognitive Neuroscience. He focuses on building working models of cognitive processes, and using them to explain psychological, developmental or neurological processes. In recent years, he has focused on anatomically-inspired deep learning models of the visual system. He has also worked on unsupervised feature learning (modeling precortical and cortical coding), face & object processing, visual salience, and visual attention. His other interest is applying AI to problems in other areas of science or engineering. Most recently he has been using deep learning to elucidate the structure of small (natural product) molecules from their NMR spectra in collaboration with Bill Gerwick at the Scripps Institute of Oceanography. He received his PhD in 1985 from the University of Rochester under James F. Allen (thesis title: A connectionist approach to word sense disambiguation). He then did a postdoc with David E. Rumelhart at the Institute of Cognitive Science at UCSD until 1987, when he joined the CSE Department.

Full details can be found on the Mind and Machine website.