Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
287.74 Кб

3 Design of Experiments

The experiments presented in this paper were conducted in the woods series of environments. These are grid worlds in which each cell can contain a tree (a T symbol), food (an F symbol), or can be empty. An animat placed in the environment must learn to reach food cells. The animat senses the environment by eight sensors, one for each adjacent cell, and can move in to any of the adjacent cells. If the destination cell contains a tree, the move does not take place. If the destination cell is blank, the move does take place. Finally, if the cell contains food, the animat moves, eats the food, and receives a constant reward. Each sensor is represented by two bits: 10 indicates the presence of tree T; 11 indicates food F; 00 represents an empty cell. Classifier conditions are 16 bits long (2 bits x 8 cells), while the eight actions are represented with three bits.

Each experiment consists of a number of problems that the animat must solve. For each problem, the animat is randomly placed in a blank cell of the environment; then it moves under the control of the system until it enters a food cell, eats the food, and receives a constant reward. The food immediately re-grows and a new problem begins. We employed the following exploration/exploitation strategy (Wilson, 1995; Wilson, 1996): before a new problem begins, the animat decides with probability 0.5 whether it will solve the problem in exploration or exploitation.

We employed two different exploration strategies: random exploration and biased exploration. In random exploration, the system selects the action randomly among those in the match set. In biased exploration, the system decides with a probability P[sub s] whether to select an action randomly or to choose the action which predicts the highest payoff (a typical value for P[sub s] is 0.5). In exploitation, the animat always selects the action which predicts the highest payoff and the GA does not act. In order to evaluate the final solutions evolved, exploration is turned off in each experiment during the last 1000 problems and the system works in exploitation only. The performance of XCS is computed as the average number of steps to food in the last 50 exploitation problems. Every statistic presented in this paper is averaged over ten experiments.

4 Xcs in Maze5 and Maze6

The first results reported in the literature for XCS by Wilson (1995) are limited to two regular and aperiodic environments, Woods1 and Woods2, in which the optimal solution requires only a few steps to reach a food position. It can be described by a small number of very general classifiers and, roughly speaking, we say that these environments permit many generalizations. These initial experiments were extended by Lanzi (1997) to a more challenging environment, Maze4, in which the optimal solution requires longer sequences of actions to reach the goal, and the environment permits only a few generalizations. The author observed that in difficult sequential problems the system performance can fail dramatically. It was argued that this happens because in particularly difficult situations, characterized by long sequences of actions and only a few admissible generalizations, the generalization mechanism of XCS can be too slow to eliminate overly general classifiers before they proliferate in the population causing a significant decrease in the system performance (briefly, we say that overly general classifiers corrupt the population)(Lanzi, 1997b). The specify operator was thus introduced in order to help XCS recover from overly general classifiers.

Wilson (1997) suggested that another important factor underlying what was observed in Lanzi (1997) is the amount of random exploration the agent performs. Accordingly, he proposed a different solution in which the amount of random exploration that the agent performs is reduced by replacing random exploration, employed in the first work on XCS with biased exploration. Wilson (1997) also suggested that the behavior discussed in Lanzi (1997) may occur when no classifier in the action set is very accurate. When this occurs, the classifier fitness calculation, which estimates the classifier accuracy with respect to the action set, will give them all substantial fitnesses producing inappropriate results. Specify detects such conditions because it is activated by the error parameter and not by the accuracy. Thus it is able to recover from this type of situation by eliminating the source of inaccuracy in the action set.

We now extend previous results presented in the literature by comparing the two solutions in two new Markovian (i.e., all the states are distinguishable) environments: Maze5 and Maze6 (Figure 1 (a) and Figure 1 (b)). We compare four algorithms for each environment: (i) XCS according to the original definition, that is, without subsumption deletion; (ii) XCS without don't care symbols (#s are not introduced in the initial population, covering nor during mutation); (iii) XCS with specify, referred to here as XCSS; (iv) XCS with biased exploration.

Notice that the performances of algorithms (i) and (ii) are two important references. The former indicates what the original system can do when the generalization mechanism is in operation; while the performance of algorithm (ii) defines the potential capabilities of XCS without generalization operating. Before proceeding, we wish to point out that the results presented are not intended to indicate which strategy is best for solving the proposed problems. Our aim is to analyze more general phenomena which can be easily studied in simple environments but can be difficult to examine in more complex environments, where other settings may not work.

Соседние файлы в папке 3