Ординатура / Офтальмология / Английские материалы / Computational Maps in the Visual Cortex_Miikkulainen_2005
.pdf
|
|
|
14.3 |
Visual Coding for High-Level Tasks |
315 |
||||||
Output |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Map
Input
Fig. 14.3. Architecture of the handwritten digit recognition system. In the input, a normalized bitmap image of digit 8 is presented. The activation propagates through the afferent connections to the map, which is either a SOM or a LISSOM network; a LISSOM map is shown in this figure, together with an outline of the afferent (solid line), lateral inhibitory (dashed black line), and lateral excitatory (dotted white line) connections of one neuron. In LISSOM, the activity settles through the lateral connections into a stable activity pattern (Figure 14.8); in SOM, the response is due only to the afferent connections (Figure 14.7). This pattern is the internal representation of the input that is then recognized by the array of perceptrons at the output. In this case, the output unit representing 8 is correctly activated, with weak activations on other units representing similar digits such as 2, and 9. The gray scale from white to black represents activity from low to high at all levels.
will be used to form internal representations for handwritten digits that will then be categorized by a perceptron classifier. The performance of LISSOM representations in this task will be compared with those of the SOM self-organizing map network. As was discussed in Section 3.4, SOM is an abstract, computationally efficient model of cortical maps; however, because it does not have self-organizing lateral connections, the activation patterns on the map do not form a sparse, redundancy-reduced code. LISSOM representations lead to better recognition performance, thereby demonstrating that sparse redundancy-reduced visual coding provides an advantage for information processing.
14.3.2 Method
The recognition system consists of three levels (Figure 14.3): (1) an input sheet of 32 × 32 units, where the input digit is represented as a normalized bitmap; (2) a 20 × 20 unit LISSOM (or SOM) map, which is fully connected to the input sheet; and (3) an output array of 10 perceptron units, corresponding to digits 0 to 9, fully
316 14 Computations in Visual Maps
Fig. 14.4. Handwritten digit examples. One hundred samples from the NIST database 3 are shown demonstrating the variety of inputs in this task. Most of the digits are recognizable to a human observer; however, each digit occurs in different shapes, thicknesses and orientations, there is significant overlap between digits, and the classification is based on small, crucial differences. Such properties make automatic classification of handwritten digits very difficult.
connected to the map level. The map performs the feature analysis and decorrelation of the input, and the perceptrons perform the final recognition.
As training and testing data, the publicly available 2992 pattern subset of the National Institute of Standards and Technology (NIST) special database 3 (Garris 1992; Wilkinson, Garris, and Geist 1993) is used. These data contain samples from a large population of writers, coded into 32 ×32 bitmaps of single digits (Figure 14.4). The digits are first centered and scaled and the image is then normalized according
to |
|
|
χxy o |
|
|
|
|
χxy = |
|
, |
(14.1) |
||
|
|
|
|
|||
|
|
uv |
χ2 |
|||
|
|
|
|
|
||
|
|
|
uv o |
|
|
|
where χxy o is the original input unit activity at location (x, y). Such normalization is useful because digit segments can vary in thickness. With normalization, the map activates approximately equally for both thick and thin digits.
The map activation and learning take place as described in Sections 3.4, 4.3, and 4.4, with three extensions that result in better performance in the character recognition domain and allow comparing the results of the two architectures more directly. First, in the SOM network, the Euclidean distance similarity measure is reversed and scaled so that the maximum response is 1 and minimum is 0, as in the LISSOM
network: |
|
||
ηij = |
dmax − X− Wij |
, |
(14.2) |
|
dmax − dmin |
|
|
where ηij is the activity of map unit (i, j), Xis the input vector and Wij is the unit’s weight vector, and dmax and dmin are the largest and smallest Euclidean distances between the input vector and the weight vectors on the map.
14.3 Visual Coding for High-Level Tasks |
317 |
Second, in the LISSOM network, the RFs of all neurons cover the whole input image, for two reasons: (1) Such connectivity matches that of the SOM model, and
(2) the resulting map representations do not have spatial organization by design; any such structure emerges from the visual coding, making the analysis and comparison easier.
Third, instead of keeping the total sum of the afferent weights constant in LISSOM, they are normalized so that the length of the weight vector remains the same:
Axy,ij = |
|
Axy,ij + αAχxy ηij |
|
, |
(14.3) |
|
|
|
|
|
|||
|
uv |
(Auv,ij + αAχuv ηij )2 |
|
|||
|
|
|
|
|
||
where Axy,ij is the afferent weight between input unit (x, y) and map unit (i, j), χxy is the normalized activity of the input unit (x, y), αA is the afferent learning rate. Since the input vectors are normalized to constant length, normalizing the weight vectors in the same way allows forming an accurate mapping of the input when the scalar product is used as the similarity measure (Section 14.4).
Although the LISSOM map can be organized starting from initially random afferent weights, a more controlled procedure was utilized in the character recognition experiments. A rough initial order was first developed in a SOM map, two copies were made of it, and their training was continued in two different ways: one as a LISSOM map, after normalizing the afferent weights and adding lateral weights, and the other as a SOM map. This procedure is useful because the resulting SOM and LISSOM maps are likely to have similar large-scale organization. The remaining differences reflect primarily the differences in visual coding, which makes comparisons easier.
The perceptrons receive the entire activation pattern on the map as their input. The activation for the perceptron unit ψk is calculated according to
ψk = ηij Pij,k , |
(14.4) |
ij |
|
where ηij is the activity of map unit (i, j) and Pij,k is the connection weight between map unit (i, j) and perceptron k. The activity ψk represents the likelihood that the input belongs to category k. Thus, the digit represented by the perceptron with the largest activation is taken as the decision of the system. The perceptrons are trained with the delta rule (Haykin 1994; Widrow and Hoff 1960), by changing each weight proportionally to the map activity and the difference between the output and the target:
Pij,k = Pij,k + αP(Tk − ψk )ηij , |
(14.5) |
where αP is the learning rate parameter and Tk is the target value (Tk = 1 if k is the correct digit, and zero otherwise).
Instead of perceptrons, other supervised classifiers such as backpropagation, radial basis function networks, or support vector machines (Ben-Hur, Horn, Siegelmann, and Vapnik 2001; Haykin 1994; Moody and Darken 1990; Rumelhart et al.
318 14 Computations in Visual Maps
1986) could be trained to perform the recognition, and they would be likely to perform better. However, the goal of these experiments is not to engineer the best possible digit recognition system, but to demonstrate that some visual codes are easier to recognize than others. Because perceptrons are more sensitive to the separability of the input patterns than the other approaches, they should make such differences more clear in the performance of the system.
Digit recognition performance was measured in a 12-fold cross-validation experiment. In each split of the data into training and testing, the SOM and LISSOM maps were first organized with a training set, and the perceptron network was then trained with the responses of the final maps to the training set patterns: Different perceptron networks were trained with the SOM representations, the initial LISSOM representations, and the settled LISSOM representations. In addition, a fourth perceptron network was trained with the raw bitmap patterns in the training set. The performance of each of these networks was measured both with the training set and with the test set.
The simulation details are specified in Appendix F.2; the representations on the SOM and LISSOM maps and the recognition performance of the different networks is analyzed in the next two subsections.
14.3.3 Forming Map Representations
The final afferent weights for the SOM and LISSOM maps from one example split are shown in Figures 14.5 and 14.6. Even though they were trained from the same intermediate organization with the same sequence of inputs, SOM and LISSOM maps show different final organization. The SOM afferent weights are sharply tuned to the input patterns, and clear clusters for each digit 0 to 9 are found on the map. In contrast, the LISSOM afferent weights do not represent all digit categories equally. For example 2 and 5, which were somewhat less frequent in the dataset, are not represented distinctly anywhere, but appear only as a combination of digits 0 and 3.
Because of these differences in the afferent weights, the responses on the SOM map are continuous and diffuse, whereas LISSOM’s initial responses are sparser, with contracted activity in several clusters (Figures 14.7b and 14.8b,c). The average kurtosis of these responses over all 2992 patterns was 0.84684, significantly higher than the 0.42988 of the SOM (p < 10−5).
As expected, the strongest lateral connections in the LISSOM map link primarily to areas with similar afferent weights, i.e. those that respond to similar inputs (Figure 14.6). Their effect is to decorrelate and reduce redundant activation, forming a sparse response. The average kurtosis of the settled LISSOM activation was 2.2560, which is significantly higher than that of the initial responses (p < 1014).
At first glance, the LISSOM map and its activity patterns seem less ordered than those of the SOM. However, the differences are mostly due to local vs. distributed style of representation, not regularity. Since the lateral connections in LISSOM link areas that respond to similar inputs, they implement more general neighborhoods than the two-dimensional local areas of the SOM. Representations far apart on the
14.4 Discussion |
319 |
map can act as neighbors, and the responses are highly regular, even though they are less localized.
These distributed patterns act as attractors of the recurrent network, and they make the LISSOM representations more easily recognizable during performance. The settling process reduces the differences between similar patterns, by reducing redundant activation and focusing the activity at the usual distributed locations for that digit. In contrast, even though the initial responses to different digits may overlap significantly (Figure 14.8b), their differences become amplified during settling (Figure 14.8c). As a result, patterns within a category appear more similar and those across categories more different than they do in either the initial response or the SOM response, making classification of LISSOM representations easier.
In the following subsection, computational support for this informal interpretation will be provided by using perceptrons to measure the separability and generalizability of the map representations.
14.3.4 Recognizing Map Representations
Recognition performance of the perceptron serves as a measure of how good the map representations are. Performance on the training patterns can be used as a measure of separability of the patterns, i.e. how difficult the task is, and the performance on the test patterns can be used as a measure of how regular the representations are, i.e. how easy it is to generalize to new inputs.
Recognition performance based on settled LISSOM, initial LISSOM response, SOM, and raw input were measured and compared over the 12 splits of data into training and test sets. The settled LISSOM patterns turned out to be the easiest to learn (92.9%), followed by the initial LISSOM response (90.2%), SOM (88.7%), and raw input (84.8%). On the test sets, settled LISSOM patterns also performed best at 90.2%, followed by initial LISSOM responses (88.7%), SOM (85.9%), and
far below, the raw input (54.4%). All of these differences are statistically significant (p < 10−4).
The main conclusion from the digit recognition experiments is, therefore, that the sparse, redundancy-reduced internal representations provided by the LISSOM network are both most separable and easiest to generalize. In addition to being efficient, such representations provide a solid foundation for further stages of visual processing, such as pattern recognition.
14.4 Discussion
Comparing the kurtosis and reconstruction of the initial and settled LISSOM response demonstrates that self-organized long-range lateral connections are sufficient to form a sparse, redundancy-reduced visual code. The comparisons with isotropic long-range connection patterns suggest that they are also necessary: Whenever such patterns increased kurtosis, they also lost crucial information about the input. Note that this result applies to long-range connectivity only: Isotropic connections limited
320 14 Computations in Visual Maps
Fig. 14.5. Self-organized SOM afferent weights. The fuzzy digit-like images display the afferent weights for each unit in the 20 × 20 map, in gray scale from white to black (low to high). The SOM has a regular global organization with local clusters sensitive to each digit category. For example, the lower right corner is sensitive to inputs of digit 1, and this preference gradually changes to 7 and then to 4 along the right edge of the map.
to the range of the central Gaussian of the SoG, i.e. approximately the range of a single orientation patch in the OR map, would form a sparse response without significantly degrading the visual code. However, as was mentioned in Section 11.5.1, such a map would not self-organize or perform perceptual grouping properly, and would be an incomplete model of computations in V1.
The handwritten digit recognition experiments in turn demonstrate that the sparse, redundancy-reduced coding not only retains the salient features of the input, but is also particularly effective as input to later stages of the visual system. It is easier to recognize the input based on the LISSOM coding than on a comparable SOM coding which is not sparse and redundancy reduced.
14.4 Discussion |
321 |
|
|
|
|
Fig. 14.6. Self-organized LISSOM afferent and lateral weights. Compared with the SOM map in Figure 14.5, the afferent weights are less sharply tuned to individual digits and the clusters are more irregular and change more abruptly, resulting in more distributed responses (Figure 14.8). The black outline identifies the lateral inhibitory connection weights with aboveaverage strength of the unit marked with the thick black square, which is part of the representation for digit 8. Inhibition goes to areas of similar functionality (i.e. areas sensitive to similar input), thereby decorrelating the map activity and forming a sparse representation of the input.
Such coding could be potentially useful in building artificial vision systems as well (Section 17.3.4). In such practical applications of the LISSOM model, it is important to make sure that the similarity between the input and the weight vectors is measured appropriately. Recall that the unit response in LISSOM is based on the weighted sum, i.e. the scalar product of the input and the weight vector, instead of Euclidean distance similarity measure as in SOM. The scalar product is biologically more realistic; however, it does not distinguish between differences in angle and length. In principle, both the input and the weight vectors should be normal-
322 14 Computations in Visual Maps
(a) Input |
(b) Map response |
Fig. 14.7. SOM activity patterns. Three samples of normalized input are shown in (a), and the response of SOM map to each input in (b). In each case, many units respond with similar activations, resulting in a broad and undifferentiated activity pattern over the map. Response patterns for different digits overlap significantly, making them difficult to classify.
ized to constant length, so that the similarity is measured only in terms of angles between vectors. In order to preserve the n-dimensional input distribution, a redundant (n + 1)th dimension must then be added to the input and weight vectors before normalization. The original dimensions are interpreted as angles and the (n + 1)th dimension represents the length of the vector, which is chosen the same for all inputs. After this transformation, the original input distribution becomes a submanifold of the (n + 1)-dimensional space. Since the dimensions are optimally chosen in the self-organizing process and the (n + 1)th input dimension is redundant, the map self-organizes to represent the original n-dimensional input distribution (Miikkulainen 1991; Sirosh and Miikkulainen 1994a).
14.4 Discussion |
323 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(a) Input |
(b) Initial map response |
(c) Settled map response |
Fig. 14.8. LISSOM activity patterns. As in Figure 14.7, column (a) shows the normalized input; the LISSOM map activity before and after lateral interaction is shown in columns (b) and (c). The initial responses are sparser than in SOM, although the responses for different digits still overlap significantly. Settling through the lateral connections removes much of the redundant activation and focuses the response around the typical active regions for each digit. After settling, the patterns for the same digit have more overlap, and those for different digits less overlap than before settling, making the digits easier to recognize.
If the input vector lengths do not vary extensively, it may be possible to achieve robust self-organization simply by normalizing the afferent weight vectors to constant length. In some applications, like the digit recognition domain in this chapter, the input may also be normalized without losing crucial information, and the mapping will then be accurate. To a degree, such input normalization takes place in the ON/OFF channels, which respond mostly to edges in the input instead of constant activation (this effect is further enhanced by afferent normalization introduced in Section 8.2.3). In such cases, it is usually sufficient to maintain the total sum of
324 14 Computations in Visual Maps
the weights constant instead of length. Such normalization constrains the afferent weights of each neuron (i, j) to the hyperplane defined by xy Axy,ij = 1, and self-organization produces a mapping of the input space projected onto this hyperplane. In high-dimensional spaces, the distortion due to this projection is small, especially when inputs are limited to a smaller dimensional submanifold, as is the case in the retina. Such a simplification is appealing from the biological standpoint because it provides a simple way to calculate the response, and a unified rule for modifying all synapses.
In the handwritten digit recognition experiments, it was interesting to see that the initial LISSOM responses were sparser and easier to recognize than the SOM responses, even though both maps were trained from the same intermediate organization with the same sequence of inputs. These differences result because the afferent weights in LISSOM learn to anticipate the lateral interactions. The final settled patterns are used to modify the afferent weights; as a result, some of the characteristics of the settled patterns become encoded into the afferent weights. Such processes are common in adapting systems in nature: Dynamic processes become automatic, and eventually may even become hardwired in the genome. This result is also promising from the point of view of building practical applications based on LISSOM networks. After the proper recognition behavior has been learned with lateral connections, the behavior can be transferred into the afferent weights with further training, resulting in a simpler system with faster performance.
14.5 Conclusion
The self-organized long-range lateral connections in LISSOM decorrelate the activation on the map, resulting in a sparse code where redundant activation is reduced and the distinguishing features are enhanced. Self-organized, specific lateral connections are necessary for such coding: When equally sparse representations are formed with isotropic lateral connections, information is lost. Such representations are efficient in that they allow more information to be represented in the same area of cortex, but they also provide an information processing advantage. They are more easily separable and generalizable, making further visual processing such as pattern recognition easier.
