Ординатура / Офтальмология / Английские материалы / Computational Maps in the Visual Cortex_Miikkulainen_2005
.pdf
15
Scaling LISSOM simulations
Current computational models such as LISSOM can account for much of the structure of the visual cortex and how it develops, as well as many of its functional properties. However, other important phenomena, such as orientation interactions between spatially separated stimuli and long-range visual contour and object integration, have remained out of reach because they require too much computation time and memory to simulate. In this chapter, two interrelated techniques are presented for making detailed large-scale simulations practical. First, a set of linear scaling equations is derived that allows computing the appropriate parameter settings for a large-scale simulation given an equivalent small-scale simulation. Second, a method called GLISSOM is developed where the map is systematically grown based on the scaling equations, allowing the entire visual cortex to be simulated at the column level with desktop workstations. The scaling equations can also be used to quantify differences in biological systems, and to determine values for the model parameters to match measurements in particular biological species.
15.1 Parameter Scaling Approach
A given LISSOM simulation focuses on a particular area of the visual field with a particularly density of retinal receptors and V1 neurons. Modeling new phenomena often requires setting up a different area or density in the model. For example, a larger portion of the visual space, e.g. a larger part of V1 and the eye, may have to be simulated, or the area may have to be simulated at a finer resolution, or a species, individual, or brain area needs to be modeled that devotes more neurons or receptors to representing the same visual space.
Varying the area or density over a wide range can be difficult in a complex nonlinear system like LISSOM. Parameter settings that work well for one size are usually not appropriate for other sizes, and it is not always clear which parameters need to be adjusted. Fortunately, with LISSOM it is possible to derive a set of equations that allows computing the appropriate parameter values for each type of transformation directly. The equations treat the cortical network as a finite approximation
326 15 Scaling LISSOM simulations
of a continuous map, i.e. one that is composed of an infinite number of units (see Amari 1980; Fellenz and Taylor 2002; Roque Da Silva Filho 1992; Wu, Amari, and Nakahara 2002 for theoretical analyses of continuous maps). Under such an assumption, networks of different sizes represent coarser or denser approximations of the continuous map, and any given approximation can be transformed into another by conceptually reconstructing the continuous map and then resampling it. Given an existing retina and cortex, the scaling equations provide the parameter values needed to self-organize a functionally equivalent smaller or larger retina and cortex.
To be most useful, such scaling should result in equivalent maps at different sizes. Map organization must therefore not depend on the initial weights, because those will vary between different-size networks. As was shown in Section 8.4, the LISSOM algorithm has exactly this property: The organization is determined by the stream of input patterns, not by the initial weight values. In effect, the LISSOM scaling equations provide a set of parameters for a new simulation that, when run, will develop similar results to the existing simulation.
The equations for scaling area and density will be derived in the next section. The more minor LISSOM parameters can be scaled as well, using the methods described in Appendix A.2. Most of the simulations presented in this book were set up using these equations, and they were crucial for the large maps used in Section 10.2. In Section 15.3, these equations will be utilized systematically by developing an incremental scaling algorithm for self-organizing very large maps.
15.2 Scaling Equations
In this section, the method for scaling the visual area, retinal receptor density, and cortical density in LISSOM models is developed. The scaling equations are derived theoretically and verified to be effective experimentally, especially analyzing the limitations of the scaling approach. Although the method applies to all versions of LISSOM, the simulations are based on the reduced LISSOM orientation map without ON/OFF channels (Chapters 6, 7, and 14; Appendix B). This model is complex enough to demonstrate the power of scaling and simple enough to observe its effects clearly. The equations are derived for the central area of the retina with full representation in the cortex (Figure A.1), but they can also be extended to include the border area (as shown in Appendix A.2).
15.2.1 Scaling the Area
The simplest case of scaling consists of changing the area of the visual space simulated. The model can be developed quickly with a small area, then enlarged to eliminate border effects and to simulate the full area of a biological experiment. To change the area, both the V1 width N and the retina width R must be scaled by the same proportion k relative to their initial values No and Ro. (The ON and OFF channels of the LGN change just as the retina does; Appendix A.2.)
15.2 Scaling Equations |
327 |
In such scaling, it is also necessary to ensure that the resulting network has the same amount of learning per neuron per iteration. Otherwise, self-organizing a larger network would take more training iterations; as a result, the final map would be different because nonlinear thresholding is performed at each iteration (Equation 4.6). To achieve the same amount of learning, the average activity per input receptor needs to remain constant. With discrete input patterns such as Gaussians, the number of patterns np per iteration must be scaled with the retinal area. With natural image inputs, it is sufficient to make sure that the images cover all of the full, larger retina; parameter np can be ignored. Assuming discrete inputs, the equations for scaling the area by a factor m are
N = mNo, R = mRo, np = m2np |
. |
(15.1) |
o |
|
|
An example of such scaling with m = 4 is shown in Figure 15.1. The simulation details are described in Appendix B.3.
15.2.2 Scaling Retinal Density
Retinal density, i.e. the number of retinal units per degree of visual field, may have to be adjusted, e.g. to model species with larger eyes or parts of the eye that have more receptors per unit area. In practice, higher density means increasing retina size R while keeping the corresponding visual area the same. This type of change also allows the cortical magnification factor N :R, i.e. the ratio between the V1 and retina densities, to be matched with values measured in a particular species. The scaling equations allow R to be increased to any desired value, although in most cases (especially when modeling newborns) a low density suffices.
The parameter adjustments to change density are slightly more complicated than those for area. First, in order to avoid disrupting how the maps develop, the visual area processed by each neuron must be kept constant. More specifically, the ratio of the afferent connection radius and the retina width must be constant, i.e. rA must scale with R.
Second, to make sure equivalent maps develop, the average total weight change per neuron per iteration must be remain the same in the original and the scaled network. When the connection radius increases, the total number of afferent connections per neuron increases dramatically. Because the learning rate αA specifies the amount of change per connection and not per neuron (Equation 4.8), the learning rate must be adjusted to compensate; otherwise, a given input pattern would modify the weights of the scaled network more than those of the original network. The afferent learning rate αA needs to be scaled inversely with the number of afferent connections to each neuron, which in the continuous plane corresponds to the area enclosed by the afferent radius. That is, αA scales by the ratio rA2 o/rA2 .
Third, because the average activity per iteration also affects self-organization, the size of the input features must also scale with R. For Gaussian inputs, the ratio between the width σ and R must be kept constant; other input types can be scaled similarly. Thus, the retinal density scaling equations are
328 15 Scaling LISSOM simulations
(a) Original retina: R = 24 |
(b) Retinal area scaled by 4.0: R = 96 |
||
|
|
|
|
|
|
|
|
|
|
|
|
(c) Original V1: |
(d) V1 area scaled by 4.0: |
N = 54, 0.4 hours, 8 MB |
N = 216, 9 hours, 148 MB |
Fig. 15.1. Scaling retinal and cortical area. The small retina (a) and V1 (c) was scaled to a size 16 times larger (b,d) using Equation 15.1. To make it easier to compare map structure, especially in early iterations, the OR maps are plotted without selectivity in this chapter. The lateral inhibitory connections of one central neuron, marked with a small white square, are indicated in white outline. The simulation time and the number of connections scale approximately linearly with the area, and thus the larger network takes about 16 times more time and memory to simulate. For discrete input patterns like these oriented Gaussians, it is necessary to have more patterns to keep the total learning per neuron and per iteration constant. Because the inputs are generated randomly across the retina, each map sees a different stream of inputs, and so the patterns of orientation patches on the final maps differ. The area scaling equations are most useful for developing a model with a small area and then scaling it up to eliminate border effects and to simulate the full area of a corresponding biological preparation.
15.2 Scaling Equations |
329 |
||
|
|
|
|
|
|
|
|
|
|
|
|
Retina
V1
(a) Original retina: (b) Retina scaled by 2.0: (c) Retina scaled by 3.0:
R = 24, σa = 7, σb = 1.5 R = 48, σa = 14, σb = 3 R = 72, σa = 21, σb = 4.5
Fig. 15.2. Scaling retinal density. Each column shows a LISSOM orientation map from one of three matched 96 ×96 networks with retinas of different densities. The parameters for each network were calculated using Equation 15.2, and each network was then trained independently on the same random stream of input patterns. The size of the input pattern in retinal units grows as the retinal density is increased, but its size as a proportion of the retina remains constant. All of the resulting maps are similar as long as R is large enough to represent the input faithfully, with almost no change above R = 48. Thus, a low value can be used for R in practice. Such scaling of retinal density is useful for modeling species and areas with higher receptor resolution, and for matching the cortical magnification factor of a model to that of a particular species.
|
|
= |
R |
|
|
|
α = |
r2 |
|
|
|
|
= |
|
R |
|
|
|
|
= |
R |
|
|
|||
r |
A |
|
r |
Ao |
, |
Ao |
α |
Ao |
, |
σ |
a |
|
|
σ |
ao |
, |
σ |
b |
|
σ |
bo |
. (15.2) |
||||
|
|
|
|
|||||||||||||||||||||||
|
|
Ro |
|
|
A |
rA2 |
|
|
|
Ro |
|
|
|
Ro |
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
Figure 15.2 demonstrates that these equations can be used to generate functionally equivalent orientation maps with different retinal receptor densities. The crucial parameters are those scaled by the retinal density equations, specifically rA and σ. The value of R is unimportant as long as it is large enough to represent input patterns of width σ faithfully. The minimum such R can be computed based on the Nyquist theorem in digital signal processing theory (e.g. Cover and Thomas 1991): The sampling frequency, determined by R, has to be at least twice the spatial frequency of the input, which is determined by σ.
If ON and OFF channels are included in the model, σ is important only if it is larger than the centers of the LGN cells, i.e. only if the input is large enough so that
330 15 Scaling LISSOM simulations
it will be represented as a pattern at the LGN output. As a result, with natural images or other stimuli with high-frequency information (i.e. small σ), the receptive field size of the LGN cells may instead be the limiting factor.
In practice, these results show that a modeler can simply use the smallest R that faithfully samples the input patterns, thereby saving computation time without significantly affecting how the map develops.
15.2.3 Scaling Cortical Density
Because of the numerous lateral connections within the visual cortex, cortical density has the largest effect on the computation time and memory in LISSOM. With scaling equations, it is possible to develop the model through a series of computationally efficient low-density simulations, only scaling up to high density to see the details in the final model. Such scaling is also crucial for simulating maps with multiple feature dimensions, such as ocular dominance, orientation, and direction (Section 5.6), because such maps can be seen more clearly at higher cortical densities.
The equations for changing cortical density are analogous to those for retinal receptor density, with the additional requirement that the intracortical connectivity and associated learning rates must be scaled as well. The lateral connection radii rE and rI should be adjusted so that their ratios with N remain constant. If the lateral excitatory radius is decreased as part of the simulation, the final radius rEf must also be adjusted accordingly. Like αA in the previous section, αE and αI must be scaled so that the average total weight change per neuron remains constant each iteration despite changes in the number of connections. Finally, the absolute weight level wd below which lateral inhibitory connections are deleted must be scaled inversely to the total number of such connections, because normalization adjusts each weight inversely proportional to the number of connections (Equation 4.8). In the continuous plane, that number is the area enclosed by the lateral inhibitory radius. Thus, the cortical density scaling equations are
|
|
|
= |
N |
|
|
|
, α = |
r2 |
|
|
, w = |
r2 |
|
|
||||
r |
E |
|
r |
Eo |
Eo |
α |
|
I o |
w |
do |
, |
||||||||
|
|
|
|
|
|||||||||||||||
|
|
No |
|
|
E |
rE2 |
Eo |
d |
rI2 |
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
(15.3) |
||||||
|
|
|
|
|
|
|
|
|
|
r2 |
|
|
|
|
|
|
|||
|
|
|
= |
N |
|
|
|
|
α = |
|
|
|
|
|
|
|
|||
r |
I |
|
r |
Io |
, |
I o |
α |
Io |
. |
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|||||||||||||
|
|
|
No |
|
|
|
I |
rI2 |
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
Figure 15.3 shows how these equations can be used to generate closely matching orientation maps with different cortical densities. Larger maps are smoother and show more detail, but the overall structure is very similar.
The Nyquist theorem specifies theoretical limits on the minimum N necessary to faithfully represent a given orientation map pattern. In practice, however, the minimum excitatory radius rEf is the limiting parameter. For instance, the map pattern from Figure 15.3e can be reduced using image manipulation software to 18 × 18 without changing the global pattern of orientation patches. Yet, when simulated in LISSOM, the 36 × 36 (and to some extent, even the 48 × 48) map differs from the larger ones. These differences result from quantization effects on rEf . Because
15.2 Scaling Equations |
331 |
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(a) 36 × 36: |
(b) 48 × 48: |
(c) 72 × 72: |
(d) 96 × 96: |
(e) 144 × 144: |
0.17 hours, |
0.32 hours, |
0.77 hours, |
1.73 hours, |
5.13 hours, |
2.0 MB |
5.2 MB |
22 MB |
65 MB |
317 MB |
Fig. 15.3. Scaling cortical density. Five LISSOM orientation maps from networks with different densities are shown. The parameters for each network were first calculated using Equation 15.3, and each network was then trained independently on the same random stream of input patterns. The number of connections in these networks ranged from 2 × 106 to 3 × 108 (requiring 2 MB to 317 MB of memory), and the simulation time from 10 minutes to 5 hours. Despite this wide range of simulation scales, the final organized maps are both qualitatively and quantitatively similar, as long as their size is above a certain minimum (about 64 × 64 in this case). Larger networks take significantly more memory and simulation time, but offer greater detail and allow multiple dimensions such as orientation, ocular dominance, and direction selectivity to be represented simultaneously.
units are laid out on a rectangular grid, the smallest radius that includes at least one other neuron is 1.0. Yet, for small enough N , the scaled rEf will be less than 1.0. If such small radii are truncated to zero, the map will no longer have local topographic ordering, because there will be no local excitation between neurons. On the other hand, if the radius is held at 1.0 while the map continues to shrink, lateral excitation will take over a larger and larger portion of the map, making the orientation patches in the resulting map wider. Thus, in practice, N should not be reduced so far that rEf < 1.0.
Together, the area and density scaling equations allow essentially any size V1 and retina to be simulated without a search for the appropriate parameters. Given fixed resources, such as a computer of a certain speed with a certain amount of memory, they make it simple to trade off density for area, depending on the phenomena being studied. The equations are all linear, so they can also be applied together to change both area and density simultaneously. Such scaling makes it easy to utilize supercomputers for very large simulations. A small-scale simulation can be first developed with standard hardware, and then scaled up to study specific large-scale phenomena on a supercomputer. Scaling can also be done step by step while the network is selforganizing, thereby maximizing the size of networks that can be simulated, as will be described in the next section.
332 15 Scaling LISSOM simulations
15.3 Forming Large Maps: The GLISSOM Approach
The scaling equations make it possible to determine the parameter settings necessary to perform a large-scale simulation. This approach can be generalized by applying it successively to larger and larger maps. This approach is called GLISSOM (growing LISSOM), and it allows scaling up LISSOM simulations much further, up to the size of the human V1.
The main idea in GLISSOM is to make use of the structure learned by the smaller network in each scaling step. Instead of self-organizing the scaled network from scratch, its initial afferent and lateral weights are interpolated from the weights of the smaller network. Such scaling allows neuron density to be increased while keeping the large-scale structural and functional properties constant, such as the organization of the orientation map. In essence, the large network is grown in place, thereby minimizing the computational resources required for the simulation.
GLISSOM is effective for two reasons. First, pruning-based self-organizing models such as LISSOM have peak computational and memory requirements at the beginning of training (Figure 15.4). At that time, all connections are active, none of the neurons are selective, and activity is spread over a wide area. As the neurons become selective and smaller regions of V1 are activated by a given input, simulation time decreases dramatically, because only the active neurons need to be simulated in a given iteration. GLISSOM takes advantage of this process by approximating the map with a very small network early in training, then gradually growing the map as selectivity and specific connectivity are established.
Second, self-organization in computational models, as well as in biology (Chapman et al. 1996), tends to proceed in a global-to-local fashion, with large-scale order established first, followed by more detailed local organization. Thus, small maps, which are faster to simulate and take less memory, can be employed first to establish global order, and large maps subsequently to achieve more detailed structure. In this manner, much larger networks can be simulated in a given computation time and in a given amount of memory.
Although the primary motivation for GLISSOM is computational, the scaling process is also well motivated biologically. It is an abstraction of how new neurons are integrated into an existing region during development. Recent experimental results suggest that new neurons continue to be added even in adulthood in many areas of primate cortex (Gould, Reeves, Graziano, and Gross 1999). Moreover, many of the neurons in the immature cortex (corresponding to GLISSOM’s early stages) have not yet begun to make functional connections, having only recently migrated to their fi- nal positions (Purves 1988). Thus, the scale-up procedure in GLISSOM corresponds to the gradual process of incorporating those neurons into the partially organized map.
15.4 GLISSOM Scaling
GLISSOM is based on the cortical density scaling equations, with one significant extension: The initial weights of the scaled network are calculated from the exist-
15.4 GLISSOM Scaling |
333 |
iterationperSeconds |
4 |
|
|
|
|
(millions)connectionsofNumber |
|
|
|
|
|
|
|
|
GLISSOM |
80 |
|
|
GLISSOM |
||||
|
|
|
|
LISSOM |
|
|
|
|
|
LISSOM |
|
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
|
|
|
|
|
2 |
|
|
|
|
|
40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
00 |
5000 |
10000 |
15000 |
20000 |
|
00 |
5000 |
10000 |
15000 |
20000 |
|
|
|
Iteration |
|
|
|
|
|
Iteration |
|
|
(a) Training time per iteration |
(b) Memory usage per iteration |
Fig. 15.4. Training time and memory usage in LISSOM vs. GLISSOM. Data are shown for a LISSOM network of 144 × 144 units and a GLISSOM network grown from 36 × 36 to 144×144 units as described in Section 15.4.2. (a) Each line shows a 20-point running average of the time spent in training for one iteration, with a data point measured every 10 iterations. Only training time is shown; times for initialization, plotting images, pruning, and scaling networks are not included. Computational requirements of LISSOM peak at the early iterations, falling as the excitatory radius (and thus the number of neurons activated by a given pattern) shrinks and as the neurons become more selective. In contrast, GLISSOM requires little computation time until the final iterations. Because the total training time is determined by the area under each curve, GLISSOM is much more efficient to train overall. (b) Each line shows the number of connections simulated at a given iteration. LISSOM’s memory usage peaks at early iterations, decreasing at first in a series of small drops as the lateral excitatory radius shrinks, and then later in a few large drops as long-range inhibitory weights are pruned at iterations 6500, 12,000, and 16,000. Similar shrinking and pruning takes place in GLISSOM, while the network size is scaled up at iterations 4000, 6500, 12,000, and 16,000. Because the GLISSOM map starts out small, memory usage peaks much later, and remains bounded because connections are pruned as the network is grown. As a result, the peak number of connections (which determines the memory usage) in GLISSOM is as low as the smallest number of connections in LISSOM.
ing weights through interpolation. The interpolation equations are presented in this section, followed by experiments that demonstrate that the method is effective.
15.4.1 Weight Interpolation Algorithm
In order to perform interpolation, the original weight matrices are treated as discrete samples of a smooth, continuous function. Under such an interpretation, the underlying smooth function can be resampled at a higher density. The resampling is equivalent to the smooth bitmap scaling done by computer graphics programs (as will be shown in Figure 15.6). This type of scaling always increases the size of the network by at least one whole row or a column at once. However, unlike the growing SOM algorithms that add nodes to the original network (Bauer and Villman 1997; Blackmore and Miikkulainen 1995; Cho 1997; Fritzke 1994, 1995; Jockusch 1990; Rodriques and Almeida 1990; Suenaga and Ishikawa 2000), in GLISSOM the origi-
334 15 Scaling LISSOM simulations
C1 |
) |
d(C,C |
|
|
3 |
C
C2
V1 |
C3 |
C4 |
|
|
B |
wCB |
B |
|
1 |
|
2 |
|
|
B |
|
|
B3 |
B4 |
|
|
|
w |
|
|
|
AB |
|
Retina |
A |
|
ON
Fig. 15.5. Weight interpolation in GLISSOM. This example shows a V1 of size 4 ×4 being scaled to 7 × 7, with a fixed 8 × 8 retina. Both V1 networks are plotted in a continuous twodimensional area representing the surface of the cortex. The squares in V1 represent neurons in the original network (i.e. before scaling) and circles represent neurons in the new, scaled network. A is a retinal receptor cell and B and C are neurons in the new network. Afferent connection strengths to neuron B in the new network are calculated based on the connection strengths of the ancestors of B, i.e. those neurons in the original network that surround the position of B (B1, B2, B3, and B4 in this case). The new afferent connection strength wAB from receptor A to B is a normalized combination of the connection strengths wABi from A to each ancestor Bi of B, weighted inversely by the distance d(B, Bi ) between Bi and B. Lateral connection strengths from C to B are calculated similarly, as a proximity-weighted combination of the connection strengths between the ancestors of those neurons. Thus, the connection strengths in the scaled network consist of proximity-weighted combinations of the connection strengths in the original network.
nal network is completely replaced by the scaled network. The interpolation process is similar to that used in continuous SOM algorithms (Campos and Carpenter 2000; Goppert¨ and Rosenstiel 1997). However, whereas continuous SOM methods interpolate to approximate functions more accurately, in GLISSOM the result forms a starting point for further self-organization.
Let us first derive the interpolation procedure for the afferent connections. Assume the original and the scaled networks are overlaid uniformly on the same twodimensional area, as shown in Figure 15.5. The afferent connection weight from retinal receptor A to neuron B in the scaled network is calculated based on the cor-
