Click on the title to get more information about a publication.
pre-print
deep-learning, vision
MindSet: Vision. A toolbox for testing DNNs on key psychological experiments.
Valerio Biscione, Dong Yin, Gaurav Malhotra, Marin Dujmovic, Milton L Montero, Guillermo Puebla, Federico Adolfi, Rachel F Heaton, John E Hummel, Benjamin D Evans, Karim Habashy, Jeffrey S Bowers
arXiv
tl;dr: We create a benchmark for testing Deep Learning models of vision based on classic findings from behavioural studies on human vision.
Multiple benchmarks have been developed to assess the alignment between deep neural networks (DNNs) and human vision. In almost all cases these benchmarks are observational in the sense they are composed of behavioural and brain responses to naturalistic images that have not been manipulated to test hypotheses regarding how DNNs or humans perceive and identify objects. Here we introduce the toolbox MindSet: Vision, consisting of a collection of image datasets and related scripts designed to test DNNs on 30 psychological findings. In all experimental conditions, the stimuli are systematically manipulated to test specific hypotheses regarding human visual perception and object recognition. In addition to providing pre-generated datasets of images, we provide code to regenerate these datasets, offering many configurable parameters which greatly extend the dataset versatility for different research contexts, and code to facilitate the testing of DNNs on these image datasets using three different methods (similarity judgments, out-of-distribution classification, and decoder method). We test ResNet-152 on each of these methods as an example of how the toolbox can be used.
tl;dr: Mechanistic models frequently explain human behavioural differences in terms of parameter values in an internal generative model. We argue that researchers should go beyond mapping behaviour to parameter values and take a participant's perspective in an experiment, asking why and how parameters take on certain values.
Cognitive scientists and neuroscientists are increasingly deploying computational models to develop testable theories of psychological functions and make quantitative predictions about cognition, brain activity and behaviour. Computational models are used to explain target phenomena such as experimental effects, individual and/or population differences. They do so by relating these phenomena to the underlying components of the model that map onto distinct cognitive mechanisms. These components make up a``cognitive state space'', where different positions correspond to different cognitive states that produce variation in behaviour. We examine the rationale and practice of such model-based inferences and argue that model-based explanations typically miss a key ingredient: they fail to explain why and how agents occupy specific positions in this space. A critical insight is that the agent's position in the state space is not fixed, but that the behaviour they produce is the result of a trajectory. Therefore, we discuss (i) the constraints that limit movement in the state space,(ii) the reasons for moving around at all (ie agents' objectives); and (iii) the information and cognitive mechanisms that guide these movements. We review existing research practices, from experimental design to the model-based analysis of data, and discuss how these practices can (and should) be improved to capture the agent's dynamic trajectory in the state space. In so doing, we stand to gain better and more complete explanations of the variation in cognition and behaviour over time, between different environmental conditions and between different populations or individuals.
Inferring DNN-Brain alignment using Representational Similarity Analyses can be problematic.
Marin Dujmović, Jeffrey S Bowers, Federico Adolfi, Gaurav Malhotra
ICLR 2024 Workshop on Representational Alignment (Re-Align)
tl;dr: RSA is frequently used to measure similarity between deep neural networks (DNNs) and brains. However, we show that two visual systems that identify objects based on different visual features can still give you high RSA scores, calling into question the use of this metric.
Representational Similarity Analysis (RSA) has been used to compare representations across individuals, species, and computational models. Here we focus on comparisons made between the activity of hidden units in Deep Neural Networks (DNNs) trained to classify objects and neural activations in visual cortex. In this context, DNNs that obtain high RSA scores are often described as good models of biological vision, a conclusion at odds with the failure of DNNs to account for the results of most vision experiments reported in psychology. How can these two sets of findings be reconciled? Here, we demonstrate that high RSA scores can easily be obtained between two systems that classify objects in qualitatively different ways when second-order confounds are present in image datasets. We argue that these confounds likely exist in the datasets used in current and past research. If RSA is going to be used as a tool to study DNN-human alignment, it will be necessary to experimentally manipulate images in ways that remove these confounds. We hope our simulations motivate researchers to reexamine the conclusions they draw from past research and focus more on RSA studies that manipulate images in theoretically motivated ways.
On the importance of severely testing deep learning models of cognition.
Jeffrey S Bowers, Gaurav Malhotra, Federico Adolfi, Marin Dujmović, Milton L Montero, Valerio Biscione, Guillermo Puebla, John H Hummel, Rachel F Heaton
Cognitive Systems Research
tl;dr: We argue that many unwarranted conclusions regarding deep neural network (DNN) and human similarities are drawn because of a lack of severe testing of hypotheses.
Researchers studying the correspondences between Deep Neural Networks (DNNs) and humans often give little consideration to severe testing when drawing conclusions from empirical findings, and this is impeding progress in building better models of minds. We first detail what we mean by severe testing and highlight how this is especially important when working with opaque models with many free parameters that may solve a given task in multiple different ways. Second, we provide multiple examples of researchers making strong claims regarding DNN-human similarities without engaging in severe testing of their hypotheses. Third, we consider why severe testing is undervalued. We provide evidence that part of the fault lies with the review process. There is now a widespread appreciation in many areas of science that a bias for publishing positive results (among other practices) is leading to a credibility crisis, but there seems less awareness of the problem here.
Human shape representations are not an emergent property of learning to classify objects.
Gaurav Malhotra, Marin Dujmović, John Hummel, Jeffrey S Bowers
Journal of Experimental Psychology: General
tl;dr: We show that humans are sensitive to relations between object features — a property that does not automatically emerge when convolutional neural networks learn to classify objects.
Humans are particularly sensitive to relationships between parts of objects. It remains unclear why this is. One hypothesis is that relational features are highly diagnostic of object categories and emerge as a result of learning to classify objects. We tested this by analyzing the internal representations of supervised convolutional neural networks (CNNs) trained to classify large sets of objects. We found that CNNs do not show the same sensitivity to relational changes as previously observed for human participants. Furthermore, when we precisely controlled the deformations to objects, human behavior was best predicted by the number of relational changes while CNNs were equally sensitive to all changes. Even changing the statistics of the learning environment by making relations uniquely diagnostic did not make networks more sensitive to relations in general. Our results show that learning to classify objects is not sufficient for the emergence of human shape representations. Instead, these results suggest that humans are selectively sensitive to relational changes because they build representations of distal objects from their retinal images and interpret relational changes as changes to these distal objects. This inferential process makes human shape representations qualitatively different from those of artificial neural networks optimized to perform image classification.
Reinforcement learning under uncertainty: expected versus unexpected uncertainty and state versus reward uncertainty.
Adnane Ez-Zizi, Simon Farrell, David Leslie, Gaurav Malhotra, Casimir JH Ludwig
Computational Brain & Behavior
tl;dr: In real-world environments, rewards are stochastic and non-stationary, and perceptual systems are noisy. This makes it difficult to learn about rewards. We explore how people update their beliefs about rewards in such complex environments.
Two prominent types of uncertainty that have been studied extensively are expected and unexpected uncertainty. Studies suggest that humans are capable of learning from reward under both expected and unexpected uncertainty when the source of variability is the reward. How do people learn when the source of uncertainty is the environment’s state and the rewards themselves are deterministic? How does their learning compare with the case of reward uncertainty? The present study addressed these questions using behavioural experimentation and computational modelling. Experiment 1 showed that human subjects were generally able to use reward feedback to successfully learn the task rules under state uncertainty, and were able to detect a non-signalled reversal of stimulus-response contingencies. Experiment 2, which combined all four types of uncertainties—expected versus unexpected uncertainty, and state versus reward uncertainty—highlighted key similarities and differences in learning between state and reward uncertainties. We found that subjects performed significantly better in the state uncertainty condition, primarily because they explored less and improved their state disambiguation. We also show that a simple reinforcement learning mechanism that ignores state uncertainty and updates the state-action value of only the identified state accounted for the behavioural data better than both a Bayesian reinforcement learning model that keeps track of belief states and a model that acts based on sampling from past experiences. Our findings suggest a common mechanism supports reward-based learning under state and reward uncertainty.
Deep Problems with Neural Network Models of Human Vision
2023
deep-learning, vision
Deep Problems with Neural Network Models of Human Vision.
Jeffrey S Bowers, Gaurav Malhotra, Marin Dujmović, Milton Llera Montero, Christian Tsvetkov, Valerio Biscione, Guillermo Puebla, Federico Adolfi, John E Hummel, Rachel F Heaton, Benjamin D Evans, Jeffrey Mitchell, Ryan Blything
Behavioral and Brain Sciences
tl;dr: We identify gaps between Deep Neural Networks and human vision and argue for controlled experiments for correctly comparing the two.
Deep neural networks (DNNs) have had extraordinary successes in classifying photographic images of objects and are often described as the best models of biological vision. This conclusion is largely based on three sets of findings: (1) DNNs are more accurate than any other model in classifying images taken from various datasets, (2) DNNs do the best job in predicting the pattern of human errors in classifying objects taken from various behavioral datasets, and (3) DNNs do the best job in predicting brain signals in response to images taken from various brain datasets (e.g., single cell responses or fMRI data). However, these behavioral and brain datasets do not test hypotheses regarding what features are contributing to good predictions and we show that the predictions may be mediated by DNNs that share little overlap with biological vision. More problematically, we show that DNNs account for almost no results from psychological research. This contradicts the common claim that DNNs are good, let alone the best, models of human object recognition. We argue that theorists interested in developing biologically plausible models of human vision need to direct their attention to explaining psychological findings. More generally, theorists need to build models that explain the results of experiments that manipulate independent variables designed to test hypotheses rather than compete on making the best predictions. We conclude by briefly summarizing various promising modeling approaches that focus on psychological data.
The role of capacity constraints in Convolutional Neural Networks for learning random versus natural data.
Christian Tsvetkov, Gaurav Malhotra, Benjamin D Evans, Jeffrey S Bowers
Neural Networks
tl;dr: We show that CNNs exhibit a super-human capacity to learn visual inputs, which can be partially remedied by introducing internal noise in activations.
Convolutional neural networks (CNNs) are often described as promising models of human vision, yet they show many differences from human abilities. We focus on a superhuman capacity of top-performing CNNs, namely, their ability to learn very large datasets of random patterns. We verify that human learning on such tasks is extremely limited, even with few stimuli. We argue that the performance difference is due to CNNs’ overcapacity and introduce biologically inspired mechanisms to constrain it, while retaining the good test set generalisation to structured images as characteristic of CNNs. We investigate the efficacy of adding noise to hidden units’ activations, restricting early convolutional layers with a bottleneck, and using a bounded activation function. Internal noise was the most potent intervention and the only one which, by itself, could reduce random data performance in the tested models to chance levels. We also investigated whether networks with biologically inspired capacity constraints show improved generalisation to out-of-distribution stimuli, however little benefit was observed. Our results suggest that constraining networks with biologically motivated mechanisms paves the way for closer correspondence between network and human performance, but the few manipulations we have tested are only a small step towards that goal.
Advances in Neural Information Processing Systems (NeurIPS)
tl;dr: In Montero et al. (2021) we showed that making generative models disentangled doesn't necessarily lead to better combinatorial generalisation (see below). Here, we show that this problem is not simply due to limitation of the decoder. Even latent representations of highly disentangled VAEs fail to show combinatorial generalisation.
Recent research has shown that generative models with highly disentangled representations fail to generalise to unseen combination of generative factor values. These findings contradict earlier research which showed improved performance in out-of-training distribution settings when compared to entangled representations. Additionally, it is not clear if the reported failures are due to (a) encoders failing to map novel combinations to the proper regions of the latent space, or (b) novel combinations being mapped correctly but the decoder is unable to render the correct output for the unseen combinations. We investigate these alternatives by testing several models on a range of datasets and training settings. We find that (i) when models fail, their encoders also fail to map unseen combinations to correct regions of the latent space and (ii) when models succeed, it is either because the test conditions do not exclude enough examples, or because excluded cases involve combinations of object properties with it's shape. We argue that to generalise properly, models not only need to capture factors of variation, but also understand how to invert the process that causes the visual stimulus.
Feature blindness: a challenge for understanding and modelling visual object recognition.
Gaurav Malhotra, Marin Dujmović, Jeffrey S Bowers
PLOS Computational Biology
tl;dr: We find that humans ignore highly predictive non-shape features in novel objects, a behaviour that contrasts with Deep Neural Networks and demonstrates the inflexibility of human shape-bias.
Humans rely heavily on the shape of objects to recognise them. Recently, it has been argued that Convolutional Neural Networks (CNNs) can also show a shape-bias, provided their learning environment contains this bias. This has led to the proposal that CNNs provide good mechanistic models of shape-bias and, more generally, human visual processing. However, it is also possible that humans and CNNs show a shape-bias for very different reasons, namely, shape-bias in humans may be a consequence of architectural and cognitive constraints whereas CNNs show a shape-bias as a consequence of learning the statistics of the environment. We investigated this question by exploring shape-bias in humans and CNNs when they learn in a novel environment. We observed that, in this new environment, humans (i) focused on shape and overlooked many non-shape features, even when non-shape features were more diagnostic, (ii) learned based on only one out of multiple predictive features, and (iii) failed to learn when global features, such as shape, were absent. This behaviour contrasted with the predictions of a statistical inference model with no priors, showing the strong role that shape-bias plays in human feature selection. It also contrasted with CNNs that (i) preferred to categorise objects based on non-shape features, and (ii) increased reliance on these non-shape features as they became more predictive. This was the case even when the CNN was pre-trained to have a shape-bias and the convolutional backbone was frozen. These results suggest that shape-bias has a different source in humans and CNNs: while learning in CNNs is driven by the statistical properties of the environment, humans are highly constrained by their previous biases, which suggests that cognitive constraints play a key role in how humans learn to recognise novel objects.
Benjamin D Evans, Gaurav Malhotra, Jeffrey S Bowers
Neural Networks
tl;dr: We find that adding a layer of Gabor and centre-surround filters to CNNs helps them generalise to out-of-distribution stimuli
Deep Convolutional Neural Networks (DNNs) have achieved superhuman accuracy on standard image classification benchmarks. Their success has reignited significant interest in their use as models of the primate visual system, bolstered by claims of their architectural and representational similarities. However, closer scrutiny of these models suggests that they rely on various forms of shortcut learning to achieve their impressive performance, such as using texture rather than shape information. Such superficial solutions to image recognition have been shown to make DNNs brittle in the face of more challenging tests such as noise-perturbed or out-of-distribution images, casting doubt on their similarity to their biological counterparts. In the present work, we demonstrate that adding fixed biological filter banks, in particular banks of Gabor filters, helps to constrain the networks to avoid reliance on shortcuts, making them develop more structured internal representations and more tolerance to noise. Importantly, they also gained around 20–35% improved accuracy when generalising to our novel out-of-distribution test image sets over standard end-to-end trained architectures. We take these findings to suggest that these properties of the primate visual system should be incorporated into DNNs to make them more able to cope with real-world vision and better capture some of the more impressive aspects of human visual perception such as generalisation.
International Conference on Learning Representations (ICLR)
tl;dr: We show that disentangled latent representations do not necessarily lead to better combinatorial generalisation in Variational Auto-Encoders.
Combinatorial generalisation — the ability to understand and produce novel combinations of familiar elements — is a core capacity of human intelligence that current AI systems struggle with. Recently, it has been suggested that learning disentangled representations may help address this problem. It is claimed that such representations should be able to capture the compositional structure of the world which can then be combined to support combinatorial generalisation. In this study, we systematically tested how the degree of disentanglement affects various forms of generalisation, including two forms of combinatorial generalisation that varied in difficulty. We trained three classes of variational autoencoders (VAEs) on two datasets on an unsupervised task by excluding combinations of generative factors during training. At test time we ask the models to reconstruct the missing combinations in order to measure generalisation performance. Irrespective of the degree of disentanglement, we found that the models supported only weak combinatorial generalisation. We obtained the same outcome when we directly input perfectly disentangled representations as the latents, and when we tested a model on a more complex task that explicitly required independent generative factors to be controlled. While learning disentangled representations does improve interpretability and sample efficiency in some downstream tasks, our results suggest that they are not sufficient for supporting more difficult forms of generalisation.
What do adversarial images tell us about human vision?
Marin Dujmović, Gaurav Malhotra, Jeffrey S Bowers
eLife
tl;dr: Adversarial images can fool AI systems into misclassifying images. We show that human response to adversarial images is qualitatively different from CNNs, which classify these images with high confidence.
Deep convolutional neural networks (DCNNs) are frequently described as the best current models of human and primate vision. An obvious challenge to this claim is the existence of adversarial images that fool DCNNs but are uninterpretable to humans. However, recent research has suggested that there may be similarities in how humans and DCNNs interpret these seemingly nonsense images. We reanalysed data from a high-profile paper and conducted five experiments controlling for different ways in which these images can be generated and selected. We show human-DCNN agreement is much weaker and more variable than previously reported, and that the weak agreement is contingent on the choice of adversarial images and the design of the experiment. Indeed, we find there are well-known methods of generating images for which humans show no agreement with DCNNs. We conclude that adversarial images still pose a challenge to theorists using DCNNs as models of human vision.
Hiding a plane with a pixel: shape-bias in CNNs and the benefit of building in biological constraints.
Gaurav Malhotra, Benjamin D Evans, Jeffrey S Bowers
Vision Research
tl;dr: We show that CNNs can learn highly idiosyncratic features in images. This behaviour can be ameliorated by attaching an input layer of V1-like filters.
When deep convolutional neural networks (CNNs) are trained “end-to-end” on raw data, some of the feature detectors they develop in their early layers resemble the representations found in early visual cortex. This result has been used to draw parallels between deep learning systems and human visual perception. In this study, we show that when CNNs are trained end-to-end they learn to classify images based on whatever feature is predictive of a category within the dataset. This can lead to bizarre results where CNNs learn idiosyncratic features such as high-frequency noise-like masks. In the extreme case, our results demonstrate image categorisation on the basis of a single pixel. Such features are extremely unlikely to play any role in human object recognition, where experiments have repeatedly shown a strong preference for shape. Through a series of empirical studies with standard high-performance CNNs, we show that these networks do not develop a shape-bias merely through regularisation methods or more ecologically plausible training regimes. These results raise doubts over the assumption that simply learning end-to-end in standard CNNs leads to the emergence of similar representations to the human visual system. In the second part of the paper, we show that CNNs are less reliant on these idiosyncratic features when we forgo end-to-end learning and introduce hard-wired Gabor filters designed to mimic early visual processing in V1.
Mechanistic models must link the field and the lab
Alasdair I Houston, Gaurav Malhotra
Behavioral and Brain Sciences (commentary)
tl;dr: We critique a theory of animal foraging behaviour and argue that realistic theories must build in various sources of environmental uncertainties.
In the theory outlined in the target article, an animal forages continuously, making sequential decisions in a world where the amount of food and its uncertainty are fixed, but delays are variable. These assumptions contrast with the risk-sensitive foraging theory and create a problem for comparing the predictions of this model with many laboratory experiments that do not make these assumptions.
Optimal gut size of small birds and its dependence on environmental and physiological parameters.
Adnane Ez-Zizi, John M McNamara, Gaurav Malhotra, Alasdair I Houston
Journal of Theoretical Biology
tl;dr: We show that birds have an optimal gut-size which is determined by a trade-off between energetic gains and cost of digestion and foraging.
Most optimal foraging models assume that the foraging behaviour of small birds depends on a single state variable, their energy reserves in the form of stored fat. Here, we include a second state variable—the contents of the bird's gut—to investigate how a bird should optimise its gut size to minimise its long-term mortality, depending on the availability of food, the size of meal and the bird's digestive constraints. Our results show that (1) the current level of fat is never less important than gut contents in determining the bird's survival; (2) there exists a unique optimal gut size, which is determined by a trade-off between the energetic gains and costs of maintaining a large digestive system; (3) the optimal gut size increases as the bird's digestive cycle becomes slower, allowing the bird to store undigested food; (4) the critical environmental factor for determining the optimal gut size is the mass of food found in a successful foraging effort (“meal size”). We find that when the environment is harsh, it is optimal for the bird to maintain a gut that is larger than the size of a meal. However, the optimal size of the gut in rich environments exactly matches the meal size (i.e. the mass of food that the optimal gut can carry is exactly the mass of food that can be obtained in a successful foraging attempt).
Time varying decision boundaries: Insights from optimality analysis.
Gaurav Malhotra, David S Leslie, Casimir JH Ludwig, Rafal Bogacz
Psychonomic bulletin & Review
tl;dr: We use dynamic programming to show that, in many real-world situations, optimal decision thresholds can collapse with time.
The most widely used account of decision-making proposes that people choose between alternatives by accumulating evidence in favor of each alternative until this evidence reaches a decision boundary. It is frequently assumed that this decision boundary stays constant during a decision, depending on the evidence collected but not on time. Recent experimental and theoretical work has challenged this assumption, showing that constant decision boundaries are, in some circumstances, sub-optimal. We introduce a theoretical model that facilitates identification of the optimal decision boundaries under a wide range of conditions. Time-varying optimal decision boundaries for our model are a result only of uncertainty over the difficulty of each trial and do not require decision deadlines or costs associated with collecting evidence, as assumed by previous authors. Furthermore, the shape of optimal decision boundaries depends on the difficulties of different decisions. When some trials are very difficult, optimal boundaries decrease with time, but for tasks that only include a mixture of easy and medium difficulty trials, the optimal boundaries increase or stay constant. We also show how this simple model can be extended to more complex decision-making tasks such as when people have unequal priors or when they can choose to opt out of decisions. The theoretical model presented here provides an important framework to understand how, why, and whether decision boundaries should change over time in experiments on decision-making.
Overcoming indecision by changing the decision criterion.
Gaurav Malhotra, David S Leslie, Casimir JH Ludwig, Rafal Bogacz
Journal of Experimental Psychology: General
tl;dr: We show that people frequently decrease their decision thresholds during perceptual decision-making but deviate from optimal decision boundaries.
The dominant theoretical framework for decision making asserts that people make decisions by integrating noisy evidence to a threshold. It has recently been shown that in many ecologically realistic situations, decreasing the decision boundary maximizes the reward available from decisions. However, empirical support for decreasing boundaries in humans is scant. To investigate this problem, we used an ideal observer model to identify the conditions under which participants should change their decision boundaries with time to maximize reward rate. We conducted 6 expanded-judgment experiments that precisely matched the assumptions of this theoretical model. In this paradigm, participants could sample noisy, binary evidence presented sequentially. Blocks of trials were fixed in duration, and each trial was an independent reward opportunity. Participants therefore had to trade off speed (getting as many rewards as possible) against accuracy (sampling more evidence). Having access to the actual evidence samples experienced by participants enabled us to infer the slope of the decision boundary. We found that participants indeed modulated the slope of the decision boundary in the direction predicted by the ideal observer model, although we also observed systematic deviations from optimality. Participants using suboptimal boundaries do so in a robust manner, so that any error in their boundary setting is relatively inexpensive. The use of a normative model provides insight into what variable(s) human decision makers are trying to optimize. Furthermore, this normative model allowed us to choose diagnostic experiments and in doing so we present clear evidence for time-varying boundaries.
Please note that the PDF articles provided here are for your own personal, scholarly use. Please do not distribute or post these files.
Copyright Notice: The documents distributed here have been provided as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.