Ординатура / Офтальмология / Английские материалы / Computational Maps in the Visual Cortex_Miikkulainen_2005
.pdf304 13 Understanding Perceptual Grouping: Contour Integration
devices while human subjects are freely browsing the environment. Such statistics will account for both environmental and attentional biases, thus accurately representing the input distributions in the different parts of the visual field. This information makes it possible to predict perceptual performance of the different cortical areas.
However, any such comparison would have to take into account the structural differences in the optics and the retina. For example, peripheral inputs tend to be more blurred, and there are far fewer photoreceptors in the periphery than in the fovea. As a result, small details that can easily be seen in the fovea may not be visible in the periphery. However, when inputs are larger, contour integration in the periphery could be similar to that in the fovea (as suggested by W. S. Geisler, personal communication, January 9th, 2004). Such structural factors should be taken into account when gathering input statistics and reasoning about the possible causes of functional divisions. The predictions of such an extended model can then be tested in developmental neurobiological experiments, by manipulating the visual input and measuring the resulting connectivity patterns and contour integration performance, as will be discussed in Section 16.4.8.
13.6 Conclusion
In this chapter, the self-organized afferent and lateral connections of the PGLISSOM model were shown to perform contour integration similarly to human subjects. The model shows how visual input statistics, lateral connection patterns, and perceptual performance are related. It suggests a concrete, testable explanation for how illusory contours arise as a side effect of normal performance and why performance in different parts of the visual field differs. Understanding these processes in the model allows ascribing function to low-level neural circuitry, and provides a foundation for building models of more complex visual tasks.
Part V
EVALUATION AND FUTURE DIRECTIONS
14
Computations in Visual Maps
So far we have seen how a wide variety of psychophysical and neurobiological phenomena can be explained by computations in a laterally connected, self-organized LISSOM network. In Part V, the LISSOM approach will be evaluated as a foundation for further research. In this chapter, the representations in LISSOM maps are analyzed experimentally, and shown to result in a sparse coding that reduces redundancies while preserving the most important features of the input. These representations serve as an efficient foundation for pattern recognition, as will be shown in an example application to handwritten digit recognition. In the next chapter, a method for scaling LISSOM to maps of different sizes, including the size of the entire visual cortex, is presented. The biological assumptions of the model are evaluated and predictions for future biological and psychological experiments are drawn from it in Chapter 16. Extensions of the model, future computational experiments, and new general research directions are proposed in Chapter 17. That chapter also briefly describes Topographica, the publicly available software package for simulating cortical maps, intended to support future computational research on computational maps in the cortex.
14.1 Visual Coding in the Cortex
How is visual information represented in the cortex? A number of researchers have proposed that the main goal of visual coding, besides representing the important features of the input, is to reduce redundancy (Atick 1992; Atick and Redlich 1990; Barlow 1985; Foldi¨ak´ 1990, 1991b; Rao and Ballard 1997; see Simoncelli and Olshausen 2001 for a review). The idea is related to methods used in compression of bitmap images, and the possible benefits are the same. Redundancy reduction could permit storing and transmitting the retinal image using fewer cells and connections, and as a result the visual cortex could process more visual information with limited resources.
The standard redundancy reduction methods aim at representing all likely inputs in a small number of coding units (e.g. neurons). Each image is coded into the activity
308 14 Computations in Visual Maps
of a small number of units, and the dimensionality of the representation is reduced. However, the cortex takes the opposite approach: A few million optic nerve fibers branch out to more than a hundred million cortical cells (Wandell 1995), and for each small region of the retina there is a large number of neurons with different feature preferences. Therefore, the visual input is expanded out and coded in the activity of a larger number of cells than in the retina. Coding in the cortex must therefore be based on an approach different from straightforward redundancy reduction.
Field (1994) suggested that the cortex might instead aim at representing the input with a minimum number of active units. For any given image, only a small subset of cortical units respond, with most neurons remaining inactive. For different images, different populations of cells are activated. Such sparse coding makes pattern recognition easier: Because each cell responds relatively rarely, it is easier to identify features. If a cell is active, it is possible to predict what inputs caused it to be active. Sparse coding also greatly reduces energy requirements, because spiking is metabolically expensive (Lennie 2003).
However, sparse coding by itself also conflicts with neurobiological evidence. Without redundancy reduction, at least the same number of cells would be active in V1 as in the retinal image, and all of this redundant activity would be carried through to the higher processing levels. The higher levels would then have to be at least as large as V1. However, the higher processing areas in the brain are invariably smaller than the primary visual cortex, and become smaller as one proceeds up the cortical hierarchy.
Taken together, however, sparse coding and redundancy reduction do constitute a strong, consistent hypothesis for the nature of the visual code. More specifically, the receptive field properties of V1 units produce a sparse representation of the input, because any given visual pattern matches only a small percentage of the neurons’ RFs. Redundancy in this sparse response is then reduced by the lateral interactions within V1. As a result, an efficient, sparse coding of the input is formed, suitable for further processing by higher levels of the visual system.
This hypothesis rests crucially on the lateral interactions in V1. As proposed by Barlow (1972, 1989, 1990), lateral connections in V1 could suppress redundant activation by decorrelating the V1 responses. Such a process indeed takes place if neurons that respond to similar inputs are connected with inhibitory lateral connections. In such a case, the response of one neuron can be predicted based on the response of the other. Therefore, the activity of the second neuron is redundant, and a more efficient representation can be formed by eliminating the redundant response. Lateral inhibitory connections have exactly this effect: Whenever these neurons are active together, the inhibition tends to reduce their activation. Such decorrelation filters out the redundancies and concentrates the activity in independent feature-selective units.
The hypothesis is difficult to verify experimentally because it requires measuring activations of large numbers of neurons individually over very short time scales; such spatial and temporal resolution is not available with current imaging or recording techniques. However, computational models such as LISSOM are well suited for testing it. This section demonstrates that (1) LISSOM produces a sparse, decorrelated visual code, and (2) the specific self-organized lateral connections are crucial for this
14.2 Visual Coding in LISSOM |
309 |
process. In Section 14.3, such coding will be shown to form an effective foundation for pattern recognition applications as well, using handwritten digit recognition as an example.
14.2 Visual Coding in LISSOM
In LISSOM, the self-organized connections store information about long-term activity correlations through Hebbian learning: the stronger the correlation between two cells, the stronger the connection between them. Because the long-range connections are inhibitory, they reduce the overall activation levels, while short-range lateral excitation locally amplifies the responses of active neurons. As will be shown in this section, this process makes the resulting responses sparse without losing information, i.e. by suppressing redundant activity.
Sparse responses can also be obtained through fixed, isotropic lateral interactions, like those used by nearly all previous computational models of the visual cortex. The connection strength between two neurons in such networks depends only on the distance between the neurons, not on their response properties. Although such interactions also reduce activity, the quality of the visual code is compromised, as will be demonstrated below.
14.2.1 Method
The experiments in this section were based on the reduced LISSOM version of the perceptual grouping network in the previous chapter. Because the analysis focuses on the overall activity patterns instead of grouping, firing-rate units were used instead of spiking neurons, and only the SMAP component of V1 was included in the simulations. (As was described in Section 11.2, SMAP determines the activity patterns in the model, whereas GMAP performs a grouping function among them.) To make input reconstruction practical, the retina was reduced to 36 × 36 receptors and the cortex to 48 ×48 units. Like the perceptual grouping network, the model was trained with long, oriented Gaussians (20,000 presentations of single Gaussians with axis lengths σa = 30 and σb = 1.5; Appendix F.1), and it developed a well-organized orientation map with long-range, patchy lateral connections (Figure 14.1).
Isotropically connected networks were constructed out of the self-organized networks by replacing the lateral inhibitory weights with isotropic weights. That way, all parameters and other components of the architecture were the same for both networks, making fair comparisons possible. A variety of isotropic profiles for the lateral connections were tested, including uniform disks, radial Gaussian distributions, and radial Cauchy distributions. The best performance was found using a sum of two Gaussians (SoG), chosen as a close match to the self-organized weight profiles (Figure 14.1b; Appendix F.1).
Sparseness of the cortical response was measured as the population kurtosis (i.e. the fourth statistical moment, or peakedness, of the neuronal activity distribution; Field 1994; Willmore and Tolhurst 2001). A small number of strongly responding
310 14 Computations in Visual Maps
Weights
(a) Self-organized; |
(b) Sum of Gaussians; |
60◦ preference |
isotropic |
Fig. 14.1. Self-organized vs. isotropic lateral connections. In (a), self-organized inhibitory lateral connection weights for a sample neuron in the LISSOM orientation map are plotted in gray scale from white to black (low to high); the small white square marks the neuron itself. In (b), the connections of a sample neuron in the network with isotropic long-range connections are shown. This network was constructed by adding two isotropic Gaussians: The smaller Gaussian was chosen as wide as the central peak in the self-organized weights, and the larger to extend as far as the longest self-organized lateral connections. Therefore, all neurons that are connected in the self-organized network are also connected in the sum-of-Gaussians network.
neurons, i.e. a sparse coding, results in high kurtosis, and a broad nonspecific pattern in low kurtosis. For each network, the average amount of kurtosis was measured for a set of 10,000 random input patterns, each with two long contours composed of two or three short, oriented Gaussians (Figure 14.2a).
Information content, on the other hand, can be measured in principle by reconstructing the input pattern from the cortical response. The idea is that accurate reconstruction is possible only if information about the input pattern is retained in the coding. A lossy coding, on the other hand, would result in incomplete reconstruction. If the networks were linear, it would be possible to perform the reconstruction simply by inverting the network, i.e. by backprojecting a set of V1 activity patterns through the afferent weights to produce a pattern on the retina. However, the neurons’ activation functions are nonlinear, and their response depends on the nonlinear effect of the lateral connections. It is therefore not practical to reconstruct the input simply by mathematically inverting the network function.
However, an approximate inverse can be obtained by training a nonlinear neural network to compute it. One effective approach is to train a feedforward backpropagation neural network to map each V1 activity pattern to the retinal activity pattern that led to that V1 response. Such networks in general are effective in pattern recognition tasks, and also plausible as a model of how humans learn higher cognitive tasks (Bechtel and Abrahamsen 2002; Elman, Bates, Johnson, Karmiloff-Smith, Parisi, and Plunkett 1996; McClelland and Rogers 2003; Rumelhart et al. 1986; Sejnowski and Rosenberg 1987). One such reconstruction network was trained for the initial V1 response, another for the V1 response settled through self-organized lateral interactions, and a third for the V1 response settled through isotropic SoG lateral interactions. Each reconstruction network was trained and tested in a 10-fold cross-validation experiment with subsets of the same 10,000 retinal and V1 activity
14.2 Visual Coding in LISSOM |
311 |
patterns that were used to measure kurtosis. The network and learning parameters were optimized to obtain the best performing network for each case (Appendix F.1). Any differences in reconstruction ability can then be attributed to the quality of the V1 representations.
In this manner, both how sparse the representations are and how well they represent information can be measured. In the next two subsections, representations formed with self-organized lateral connections, isotropic lateral connections, and without any lateral connections will be compared.
14.2.2 Sparse, Redundancy-Reduced Representations
The main effect of the self-organized lateral interactions is to make the cortical representation of the input sparser (compare Figure 14.2b and d). The recurrent excitation and inhibition focuses the activity to the neurons best tuned to the features of the input stimulus, producing a sharper response. The average kurtosis of the V1 activity patterns before settling was 35.4, which was more than doubled, to 85.7, by settling through the self-organized connections. The average total activity in the response reflected this change, reducing from 16.7 to 9.32 through the settling.
Importantly, the sparse coding is formed without losing information. As Figure 14.2c,(e) demonstrates, the retinal patterns can be reconstructed from the settled response just as well as from the initial response. In order to measure the reconstruction ability numerically, the percentage of output patterns that were identifiable, i.e. closest to the correct pattern in the test set, were counted. In both cases, 100% of test patterns (in a 10-fold cross-validation experiment) resulted in identifiable reconstructions.
Thus, the experiments with LISSOM provide computational evidence for the sparse coding with redundancy-reduction hypothesis. By decorrelating the V1 activity, the self-organized lateral connections form a sparse code without losing information.
14.2.3 The Role of Self-Organized Lateral Connections
Are self-organized lateral connections necessary to achieve sparse redundancyreduced coding? It turns out that while the SoG network can indeed form a sparse code, it does so by reducing information instead of only redundancy.
In three control experiments, SoG networks were adjusted to perform sparse coding. First, the overall strength of the lateral inhibitory weights was set so that the Gaussian peak in the SoG was as high as the central peak of the self-organized weights (i.e. γI in Equation 4.7 was increased from 4 to 40 while γE remained at 0.9). The goal was to ensure that the SoG network included the lateral connections from the self-organized network, and differed from it by including additional connections as well. As a result, all activity in V1 was eliminated during settling. This result suggests that for the SoG network to form any visual coding at all, the individual long-range isotropic connections must be weaker than the individual self-organized connections. Consequently, any computation that the lateral connections perform,
312 14 Computations in Visual Maps
(b) Initial V1 response: Kurtosis 37.9
(c) Reconstruction from (b): Average RMS error 0.094
(a) Retinal activation |
(d) Settled V1 response: |
|
(e) Reconstruction from (d): |
|
Kurtosis 63.0 |
|
Average RMS error 0.094 |
|
|
|
|
|
|
|
|
(f ) SoG-settled V1 response: Kurtosis 60.1
(g) Reconstruction from (f ): Average RMS error 0.137
Fig. 14.2. Sparse, redundancy-reduced coding with self-organized lateral connections. In (a), an example test input consisting of two multi-segment contours is shown. The V1 initially responds to this pattern with multiple large patches of activation (b), but lateral interactions focus the response into the most active neurons (d). This process results in a sparse code, as shown by the increased kurtosis values beneath each plot. The settling reduces redundancy but does not lose information; the input pattern can be reconstructed from both the initial response (c) and the settled response (e) equally well. When the lateral interactions are replaced with isotropic patterns, such as a sum of two Gaussians (f ), a sparse code with a similar kurtosis results. However, crucial information about the input is lost in this process. All active neurons inhibit each other, and occasionally a crucial component of the representation is turned off. For example, the activity patch at the center represents the rightmost element of the left contour. It disappears in the settling process of the SoG network, and consequently the reconstruction image is missing this element as well (g). Self-organized patchy lateral connections are therefore crucial in forming a sparse redundancy-reduced coding of the visual input.
14.2 Visual Coding in LISSOM |
313 |
such as decorrelation or grouping, will perforce be weaker in SoG networks than in networks with self-organized lateral connections.
In the second experiment, the strength of the lateral inhibitory connections was reduced (to 11.4) until the average kurtosis of the responses reached 84.4, matching that of the self-organized network. However, the responses were now highly saturated, with average total activation of 27.1 compared with the 9.32 of the selforganized network. As a result, only 41% of the reconstructed input patterns were identifiable.
In the third experiment, therefore, the lateral inhibition and excitation were both adjusted simultaneously (γI to 14 and γE to 0.46) so that both the average kurtosis and the average total activation, at 82.3 and 9.16, were comparable to those of the self-organized network. Figure 14.2f shows a sample cortical response of this SoG network. Overall, it is very similar to that of the self-organized network: The experiment shows that it is possible for isotropic connections to achieve a sparse code similar to that of self-organized connections.
An important difference arises when the reconstruction is attempted based on the SoG representations. Whereas there was no loss of reconstruction ability in the self-organized case, the SoG network performs slightly but consistently worse: The reconstructed patterns are recognizable 99.0% of the time on average (the difference is significant with p < 10−4). The reason is apparent in Figure 14.2f ,g; with certain inputs, the settled SoG pattern is missing representations of parts of the input pattern, and as a result, those parts are also missing from the reconstructed pattern.
The loss of information in the SoG network primarily results from interactions between unrelated input components. As seen in Figure 14.2, even though the two contours have very different orientations, and thus are likely to be independent inputs instead of two parts of the same contour, their representations inhibit each other strongly in the SoG network. As a result, part of the V1 response disappears, allowing only one of the input contours to be reconstructed. In the self-organized case, even though the individual lateral inhibitory connections are stronger, they come from neurons that are often active together, reducing redundant activation only. The representations of unrelated contours do not inhibit each other, and both are retained in the settled response and in the reconstruction.
Similar but even worse results were observed for the other isotropic long-range connection patterns tested, including large Gaussians, Cauchy distributions, and uniform connections. If the isotropic connections were strong enough to provide a sparse code similar to that of the self-organized connections, they reduced the quality of the visual code. These results suggest that patchy, specific, self-organized connections are crucial for a sparse, redundancy-reduced visual code.
In conclusion, because the LISSOM model is computational, it allows testing hypotheses about visual coding in exact, quantitative terms. In doing so, selforganization is found to store long-range activity correlations between featureselective cells in the lateral connections. During visual processing, this information is used to eliminate redundant information, and enhance the selectivity of cortical cells. As a result, the model establishes a sparse, redundancy-reduced coding of the
314 14 Computations in Visual Maps
visual input, which allows representing visual information efficiently with limited resources.
14.3 Visual Coding for High-Level Tasks
The sparse, redundancy-reduced coding is efficient, given that there is a limited number of neurons, and activation is expensive. Does it also provide an advantage in information processing? The example high-level application in this section suggests that it indeed does. Recognition of handwritten digits is easier when the visual input is represented on a LISSOM map, as opposed to the SOM version of the selforganizing map discussed in Section 3.4. The domain is first described below, followed by the recognition system architecture, and the results.
14.3.1 The Handwritten Digit Recognition Task
Handwritten digit recognition, a subtask of optical character recognition (OCR), is an important problem with many practical applications. Traditional approaches to this task include algorithmic and statistical methods, such as global transformation and series expansion, geometrical and topological feature extraction, and deriving features from the statistical distribution of points (Govindan and Shivaprasad 1990). More recently, neural networks have been successfully applied to this task as well, including general feedforward networks and dedicated and hybrid methods (Fukushima and Miyake 1982; Keeler and Rumelhart 1992; LeCun et al. 1995; Lee 1996; Lee and Lee 2000a,b; Martin, Rashid, Chapman, and Pittman 1993; Yaeger, Webb, and Lyon 1998. Digit recognition systems achieve 97.6–99.8% accuracy, rivaling estimated human performance at around 99.8% (LeCun et al. 1995).
In general character recognition, context information from the word and sentence may be available, and when recognition is done on-line, also information such as velocity, continuity of line segments, and the angle of motion. However, in its most basic and useful form, recognition is done off-line on isolated, normalized digits. Such raw bitmap images can be quite confusing, since many digits share similar features. For example, the digits 7 and 9, 4 and 9, 1 and 7, and 3 and 8 have large overlapping segments, and the distinct features are proportionally smaller than the overlapping ones. Although humans are good at paying attention to the distinct features in classifying digits, it is difficult to do so automatically.
For automatic recognition to be effective, it is necessary to form an internal representation that emphasizes the salient features of the input. Such representations must be highly separable, i.e. different for different digits, and easy to generalize, i.e. similar for the variations of the same digit. These requirements are difficult to achieve at the same time. If the representations are separated too far, generalization will often suffer. Good generalization, on the other hand, usually increases overlap between categories, degrading separation.
The hypothesis tested in the experiments that follow is that the sparse, redundancyreduced representations in LISSOM are separable and generalizable, and thus constitute an effective foundation for pattern recognition. To test this hypothesis, LISSOM
