Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Computational Maps in the Visual Cortex_Miikkulainen_2005

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
16.12 Mб
Скачать

56 3 Computational Foundations

Edge neuron Center neuron

(a) Iteration 0

(b) Iteration 1000

(c) Iteration 5000

(d) Iteration 40,000

Fig. 3.5. Self-organization of weight vectors. The weight vectors of two sample units are plotted on the receptor array at different stages of self-organization. The weight values are represented in gray-scale coding from white to black (low to high). Initially (iteration 0) the weights are uniformly randomly distributed; over several input presentations (such as those shown in Figure 3.4) the weights begin to resemble the input Gaussians in different locations of the receptor surface (iterations 1000, 5000, and 40,000). A neuron at the center of the network (top row) forms a Gaussian weight pattern at the center, while a neuron at the edge (bottom row) forms one near the edge. Such weight patterns together represent the topography of the input space, as seen in Figure 3.6.

χ

k

= exp

(x − xc)2 + (y − yc)2

,

(3.16)

σu2

 

 

 

 

where (x, y) specifies the location of receptor k, (xc, yc) the center of the activity spot, and σu its width. Trained with such inputs, the map should learn to represent the two-dimensional locations on the receptor surface. In other words, if the cortical sheet is interpreted as V1 and the input sheet as the retina, the model should learn a retinotopic mapping.

Such a mapping will be formed in this section with the abstract (SOM) version of the self-organizing process, where the map responds and adapts based on Euclidean distance similarity measure (Equation 3.15). The SOM is the most common version of self-organizing maps in the literature, and this simulation therefore establishes a baseline for comparison with LISSOM in the next chapter. The map consists of 40 × 40 units fully connected to 24 × 24 receptors; the weights are initially uniformly random. The input Gaussians have a width of σu = 0.1, and their centers are chosen from a uniform random distribution, so that they are evenly scattered over the receptor surface. The rest of the parameter values are described in Appendix E.

The weight vectors of each neuron are initially random, i.e. each value is drawn from the uniform distribution within [0..1] (Figure 3.5). Over several input presentations, they gradually turn into representations of the input Gaussians at different

3.5 Knowledge Representation in Maps

57

locations. For example, the neuron at the center of the network forms a Gaussian weight pattern at the center of the receptor surface, and a neuron at the edge of the network one near the edge.

Such weight vectors form a topographic mapping of the input space. This mapping can be illustrated by first calculating the center of gravity of each neuron’s weight vector (as a weighted sum of the receptor coordinates divided by the total weight; Appendix G.2). The centers can then be plotted as points on the receptor surface, as is done in Figure 3.6. The square area in each subfigure represents the receptor surface, and the centers of neighboring neurons are connected by a line to illustrate the topology of the network. Since the afferent weights are initially random, their centers of gravity are initially clustered around the middle. As inputs are presented and weights adapt, this cluster gradually unfolds, and spreads out into a smooth grid that covers the receptor surface. In the final self-organized map the topographic order of the centers matches the topographic order of neurons in the network. The network has learned to represent stimulus location accurately, and maps the input space uniformly. The plot is slightly contracted because the Gaussian weight patterns near the edges are truncated: Although the peak of a Gaussian is close to the edge, its center of gravity is always well inside the edge (Figure 3.5d, bottom row).

More generally, the distribution of the neuron weight vectors in the final map approximates the distribution of the input vectors (Ritter 1991). A dense area of the input space, i.e. an area with many input vectors, will be allocated more units in the map (Figure 3.7; simulation parameters in Appendix E). This means that such areas are magnified in the map representation, which is useful for data analysis, and also corresponds to the structure of biological maps.

In these example simulations, the only significant feature of the input is its location, and therefore the map learns to represent retinotopy. Retinotopic mappings consist of two dimensions (x and y), and because the map is also two-dimensional, such a mapping is straightforward. However, if the input patterns are elongated and oriented, or originate from two different eyes, the map will also represent those input features. Such a case is more complicated because the input has more dimensions than are available in the map. Self-organizing map models of orientation, ocular dominance, and direction selectivity all have this property, as will be discussed in Chapter 5. The next section will show how the map represents inputs with more than two dimensions of variation in a two-dimensional structure. Such an analysis allows us to understand the organization of biological maps better.

3.5 Knowledge Representation in Maps

When there are more than two dimensions of variation in the input, similar inputs may not always be represented by nearby locations on the two-dimensional map. How will the map representation approximate high-dimensional spaces? First, most often the distribution in the high-dimensional space is not uniform. The map will form a principal curve through the lower dimensional clusters of data embedded in the high-dimensional space. Second, when clusters are indeed multidimensional,

58 3 Computational Foundations

(a) Iteration 0: Initial

(c) Iteration 5000: Expanding

(b) Iteration 1000: Unfolding

(d) Iteration 40,000: Final

Fig. 3.6. Self-organization of a retinotopic map. For each neuron in the network, the center of gravity of its weight vector is plotted as a point on the receptor surface. Each point is connected to the centers of the four neighboring neurons by a line (note that these connections only illustrate neighborhood relations between neurons, not actual physical connections through which activity is propagated). Initially the weights are random, and the centers are clustered in the middle of the receptor surface. As self-organization progresses, the points spread out from the center and organize into a smooth topographic map of the input space.

the map will form hierarchical folds in the higher dimensions. The space is covered roughly uniformly, but not all similarity relations are preserved, resulting in a patchy map organization. These principles are illustrated in the two subsections below.

3.5

(a) Gaussian distribution

Knowledge Representation in Maps

59

 

 

 

 

(b) Two long Gaussians

Fig. 3.7. Magnification of dense input areas. Whereas in Figure 3.6 the inputs were uniformly distributed over the receptor surface, map (a) was trained with inputs appearing more frequently in the middle, and map (b) with two such high-density areas diagonally from the middle. More units are allocated to representing the dense areas, which means that they are represented more accurately on the map. Similar magnification is observed in biological maps.

3.5.1 Principal Surfaces

Often with high-dimensional data, all dimensions do not vary independently, and only certain combinations of values are possible. For example, in a map representing (x, y) location, orientation, and direction selectivity, all locations and directions must be represented at all locations, but only directions that are roughly perpendicular to the orientation can ever be detected. Therefore, even though there are four dimensions of variation, the data are inherently three-dimensional. The first principle of dimensionality reduction in self-organizing maps is to find such low-dimensional structures in the data and map those structures instead of the entire high-dimensional space.

The standard linear method for such dimensionality reduction is principal component analysis (PCA; Jolliffe 1986; Oja 1989; Ritter, Martinetz, and Schulten 1992). PCA is a coordinate transformation where the first dimension (i.e. first principal component, or hyperplane) is aligned with maximum variance in the data; the second principal component is aligned with the direction of maximum variance among all directions orthogonal to the first one, and so on (Figure 3.8). If data need to be reduced to one dimension, the first principal component can be used to describe it. If two dimensions are allowed, the first two, and so on. In this way, as much of the variance can be represented in as few dimensions as possible.

The main problem with PCA is that if the data distribution is nonlinear, a lowdimensional hyperplane cannot provide an accurate description (Hastie and Stuetzle 1989; Kambhatla and Leen 1997; Ritter et al. 1992). This fact is illustrated in Fig-

60 3 Computational Foundations

y

1

PC

2

PC

x

(a) Linear distribution

y

2

PC

1

PC

x

(b) Nonlinear distribution

Fig. 3.8. Principal components of data distributions. In principal component analysis, the data originally represented in (x, y) coordinates are transformed into the principal component coordinate system: The first principal component (PC1) aligns with the direction of maximum variance in the data, and the second (PC2) is orthogonal to it. The lengths of the axes reflect the variance along each coordinate dimension. (a) The two-dimensional distribution has a linear structure, and the first component alone is a good representation. However, with a nonlinear distribution (b), PCA does not result in a good lower dimensional representation, even though the distribution lies on a one-dimensional curve.

ure 3.8b: The first principal component misses the main features of the data, and provides only an inaccurate approximation.

Nonlinear distributions are best approximated by curved structures, i.e. hypersurfaces rather than hyperplanes. For such distributions, one can define principal curves and principal surfaces, in fashion analogous to principal components (Hastie and Stuetzle 1989; Ritter et al. 1992). Intuitively, the principal curve passes through the middle of the data distribution, as shown in Figure 3.9a. The center of gravity of the area enclosed by two very close normals should lie on the principal curve.

Let us consider the task of finding a principal curve of a data distribution. Let X be a data point and f a smooth curve in the input space, and let df (X) be the distance from the data point to the closest point on the curve. The squared distance Df of data distribution P (X) to curve f can then be defined as

Df = df2 (X) P (X) dX.

(3.17)

The curve f is the principal curve of the data distribution P (X) if Df is minimal. Principal surfaces can be defined in the same way, by replacing curve f with a multidimensional surface.

It turns out that a self-organizing map is a way of computing a discretized approximation of the principal surface (Ritter et al. 1992). Assume that the principal

surface is discretized into a set of vectors Wi. For a data point X, let Wimg(X) be the closest vector. Equation 3.17 can then be written as

3.5 Knowledge Representation in Maps

61

P(X) P(X)

f

(a) Principal curve

(b) Folded curve

Fig. 3.9. Approximating nonlinear distributions with principal curves and folding. (a) The principal curve passes through the middle of the data distribution, providing a more detailed representation of nonlinear distributions than principal components. Each point on the curve is positioned at the center of gravity of the part of the distribution enclosed within two infinitesimally close normals (Ritter et al. 1992). (b) If a more detailed representation of the thickness of the distribution is desired, the curve can be folded in the higher dimension.

Df = XWimg(X) 2P (X) dX.

(3.18)

The problem of finding the principal surface now reduces to finding a set of reference vectors Wi that minimizes the squared reconstruction error. It can be shown that the learning rule

W

= W

img(X)

+ α[X

W

img(X)

]

(3.19)

img(X)

 

 

 

 

 

minimizes this error under certain conditions (Ritter et al. 1992). This rule is identical to the abstract self-organizing map adaptation rule (Equation 3.15) when there is no neighborhood cooperation (i.e. h equals the delta function). In this case, the map places each weight vector at the center of gravity of the data points for which it is a winner. When the neighborhood function is introduced, other data points contribute also, and the center of gravity is calculated based on this larger volume.

The above analysis explains why the self-organizing map can develop efficient approximations of nonlinear high-dimensional input distributions. If there is lowdimensional nonlinear structure in the input, the map can follow the nonlinearities of the input distribution, and represent local as well as global structure. Perceptual categories as well as higher-level concepts are generally believed to organize into such low-dimensional manifolds (Kohonen, Kaski, Lagus, Salojarvi,¨ Honkela, Paatero, and Saarela 2000; Li, Farkas, and MacWhinney 2004; Ritter et al. 1992; Roweis and Saul 2000; Seung and Lee 2000; Tenenbaum, de Silva, and Langford 2000; Tinoˇ and Nabney 2002). The self-organizing maps in the cortex therefore provide an efficient mechanism for representing structured sensory information in two dimensions.

62 3 Computational Foundations

3.5.2 Folding

Even though there is low-dimensional structure in the input, it may not be sufficient to represent it by a two-dimensional principal surface, i.e. the self-organizing map. For example, it may not be accurate enough to reduce the entire distribution in Figure 3.9a to the principal curve; it might also be necessary to represent how far the points are from the curve. The only way the one-dimensional curve can represent the entire area is to make tight turns across the width of the area, as well as gradually progressing along its length (Figure 3.9b). In other words, the map has to fold in the higher dimensions, in order to represent them as well as possible.

Similar folding of the map occurs when a high-dimensional distribution of points is mapped using a two-dimensional network. The results are interesting from a biological standpoint, because they help explain why patchy patterns of feature preferences, such as those for ocular dominance and orientation, form in the primary visual cortex (Kohonen 1989; Obermayer et al. 1992; Ritter et al. 1991). Let us use ocular dominance as an example. At each (x, y) location of the visual field, there are different ocular dominance values that must be represented. Let us assume that the variance in ocular dominance is smaller than the variance in the length and width dimensions. The goal is therefore to determine a dimension-reducing mapping of inputs in a flat box, where the two longer dimensions represent retinotopy, and the height dimension represents ocular dominance, onto a two-dimensional network (Figure 3.10; simulation parameters in Appendix E).

As the network self-organizes, it first stretches along the two longest dimensions of the box, and then folds in the smaller third dimension (Figure 3.10a). The folding takes place because the network tries to approximate the third dimension with the two-dimensional surface, analogous to Figure 3.9b. The weight values in the third dimension can be visualized for every neuron by coloring it with a corresponding gray-scale value, as in Figure 3.10b. The resulting pattern is very similar to the pattern of ocular dominance stripes seen in the primary visual cortex.

It is also important to understand how the mapping changes with increasing variation in the third dimension and with increasing number of dimensions. Let us first examine how the afferent patterns organize when the network is trained with input distributions of different heights. If the height is zero, all the inputs lie in a plane, and a smooth self-organized two-dimensional map will develop as in Figure 3.6. As the height is increased, fluctuations in the third dimension gradually appear, but the pattern is not stable, and keeps changing as training progresses. Beyond a threshold height zf , however, a spontaneous phase transition occurs, and a stable folding pattern develops.

When the networks are trained with inputs with more than three dimensions of variation, the same principle can be observed in a recursive fashion. In a twodimensional network, the map stretches along the two dimensions of maximum variance first, and folds along the dimensions of the next highest variance. A recursive folding structure then develops: The primary folds represent the dimensions of third highest variance, subfolds within the primary folds represent the dimensions with the next highest variance, and so on. Thus, the map develops a representation of the

3.6 Conclusion

63

 

 

 

 

(a) Representing the third dimension by folding

(b) Visualization of ocular

 

dominance

Fig. 3.10. Three-dimensional model of ocular dominance. The model consists of a twodimensional map of a three-dimensional space. The first two dimensions can be interpreted as retinotopy and the third dimension as ocular dominance (Ritter et al. 1991,1992). In (a), the input space is indicated by the box outline, and the weight vectors of the map units are plotted in this space as a grid (as in Figure 3.6). The map extends along the longer retinotopic dimensions x and y, and folds in the smaller height dimension to approximate the space. (b) The weight value for the height dimension is visualized for each neuron: Gray-scale values from black to white represent continuously changing values from low to high. The resulting pattern resembles the ocular dominance stripes found in the visual cortex, suggesting that they too could be the result of a self-organized mapping of a three-dimensional parameter space.

statistically most relevant features of the input space. All the dimensions whose variance is greater than zf are represented in the map in this recursive fashion. In this way, it is possible to capture several feature dimensions, such as retinotopy, ocular dominance, and orientation, in a single two-dimensional map. The computational model therefore offers a clear explanation for the observed overlapping map structure of the visual cortex: It is a self-organizing map representing high-dimensional input in two dimensions.

3.6 Conclusion

In understanding how the maps in the visual cortex develop and function, the cortical column has emerged as the appropriate computational unit. In most cases, the weighted-sum firing-rate model is sufficient to capture its behavior; when temporal coding is important (as in segmentation and binding), a leaky integrator model of the spiking neuron can be used, synchronizing neural activity to represent coherent bindings. These computational units adapt based on Hebbian learning, normalizing the total weight by redistributing synaptic resources.

When lateral interactions are established between such units, the self-organizing map model of the cortex is obtained. This model is a simple yet powerful learn-

64 3 Computational Foundations

ing architecture for representing the statistical structure of data distributions. If the dimensionality of the mapping surface (i.e. the network) is less than the dimensionality of the data distribution, the surface first extends nonlinearly along the most dominant dimensions of the data distribution, and then folds in the other dimensions. The folding process gives rise to local structures, such as alternating stripes resembling ocular dominance and orientation patches in the primary visual cortex. In the following chapters, such structures are shown to be not just superficially similar, but in fact good approximations of those seen in the primary visual cortex.

To make computational and mathematical analysis tractable, the self-organizing map models typically abstract away the lateral interactions in the cortex. These interactions are reintroduced in LISSOM, showing that they play a powerful and so far largely unrecognized role in cortical processing. Lateral connections between neurons can learn synergetically with the afferent connections and represent higher order statistical information in the network. This generalization results in a more accurate model of the visual cortex, accounting for development, plasticity, and many functional phenomena.

Part II

INPUT-DRIVEN SELF-ORGANIZATION