Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Стратегии детекции в рампознавании лиц / Detection Strategies For Face Recognition Using Learning And Evolution Phd Dissertation 1998

.pdf
Скачиваний:
17
Добавлен:
01.05.2014
Размер:
1.05 Mб
Скачать

gene {0,1} indicates whether or not the ith feature from the overall feature set is included in the specified feature subset. Thus, each individual in a GA population consists of fixed-length binary string representing some subset of the given feature set. The advantage of this representation is that a standard and well understood GA can be used without any modification.

Each member of the current GA population represents a competing feature subset that must be evaluated to provide fitness feedback to the evolutionary process. This is achieved by invoking C4.5 with the specified feature subset and a set of training data (reduced to include only the feature values of the specified features). The decision tree is then tested for classification accuracy on a set of unseen evaluation data. Its accuracy together with the size of the feature subset is used as the GA fitness measure.

The fitness functions for feature subsets, based on information content, orthogonality, etc., however, can still provide baseline feedback to stochastic search methods characteristic of evolutionary computation (EC), resulting in enhanced performance. As an example, GAs can search the space of all possible subsets of a large set of candidate discrimination features, capture important non-linear interactions, and thus improve on the quality of the feature subsets produced by using ranking methods only. The effectiveness of the features selected on the actual pattern recognition task can be then assessed by learning appropriate classifiers and measuring their observed performance. Learning has the ability to smooth the fitness landscape, thus facilitating evolution, and eventually what is learned becomes clamped ('phenotype rigidity'). Fig. 5.3 shows how the interplay between learning and evolution takes place in the context of pattern recognition.

Feature subsets selected by the GA module for evaluation are used by the filter sub-module ('Feature Mask') to preprocess the training data and describe it only in terms of those features. The filter sub-module also divides the learning examples into the training and testing data, respectively. The training data is used by the le a r ni n g submodule, while the testing data is used by the evaluation module to assess the (local) fitness of what has been learned so far. As the amount of training examples increases, there is more opportunity for learning and an increased potential effect on evolution. Too much learning ('overfit'), however, can result in loss of generality and negatively affect the overall robustness of the results.

Full Feature Set

 

 

 

 

Optimal Feature Subset

 

GA MODULE

 

 

 

 

 

 

 

Fitness Measure

 

 

Feature Subsets

 

EVALUATION MODULE

 

 

FOM

 

 

 

(Figure-of-Merit)

 

 

+

Classifier Evaluation

Classifier

Classifier Learning

 

 

 

Testing Data

 

Feature Set Size

 

Training Data

 

 

 

Image Data

Feature Extraction

Examples

Feature Mask

Figure 5.3. Eye Recognition: Symbiotic Adaptation for Pattern Classification

Using Learning and Evolution

41

Our belief is that such a hybrid learning system will identify significantly better feature subsets than those produced by existing methods for two reasons. First, we are exploiting the power of GAs to efficiently explore the non-linear interactions of a given set of features. Second, by using C4.5 in the evaluation loop, we have an efficient mechanism for directly measuring classification accuracy.

In order to test our ideas we have implemented a prototype version of the system. For the GA component, we used without modification GENESIS (Grefenstette, 1991), a standard GA implementation. Similarly, we used without modification C4.5, a standard implementation of Decision Tree (Quinlan, 1986), to build up the decision trees for the evaluation procedure. For both components, standard default parameter settings from the literature were used. For the GA module, this resulted in a constant population size of 50, a crossover rate 0.6 and a mutation rate 0.001. For C4.5, the pruning confidence level was set to default 25%.

Ô ÕÖCÕ×Ø!Ù Ú ÛÜÝ+Ú Þ

The ability to detect salient facial features is an important component of any face recognition system. Eye detection serves an important role in face normalization and thus facilitates further localization of facial landmarks. It allows one to focus attention on salient facial configurations, to filter out structural noise, and to achieve eventual face recognition. For these purposes, we first applied the hybrid learning system to accomplish feature selection on facial landmark classification.

102 eye, 102 nose and 102 other facial region examples are made available as learning and 52 eyes, 52 nose and 52 other facial region examples are used as test data in our experiments. The original eye image, nose image or the image of other regions has resolution 16 by 12 (column by row) pixels, which is cut from human face images (64x72 pixels). 105 features are produced for each example. The configuration of those features follows the rules listed in Table 5.2, while Table 5.3 gives the characteristics of the face data.

Features

Description

 

 

From x1 to x35

Obtained by averaging a 4x4 window which has 50% (2 pixels)

 

 

overlap in each shift.

From x36

to x70

They represent the standard deviation of a 4x4 window.

From x71

to x105

They represent the entropy for a 4x4 window.

 

 

 

Table 5.2 Configuration of 105 features

Characteristics

Number of Examples

Number of Features

Feature Value

Number of Classes

Description

Learning set of 102 examples and test set of 52 examples.

105 features.

The feature value is numerical, in the range 0 to 255.

There are 3 decision classes: 1 to 3. (eye, nose, and others)

Table 5.3 Characteristics of the Facial Data

An initial set of experiments has been performed to assess the performance of the hybrid GA-DT learning system. Subsets of optimal features for recognizing visual concepts in facial image data have been learned and

42

compared against standard methods for feature selection. The error rates on unseen image data and the tree description complexity (measured as the number of nodes) have been used as the basis for comparison.

In order to apply cross validation of the learning process, the learning data was randomly shuffled to generate two sets (Fig. 5.4). In each set 40% of examples were used for inducing decision trees and the other 60% for evaluation of the learned description. Ten experiments were performed on each set in order to find the best average performance. The set that produced the best result, i.e. the lowest error rate, was selected as the training set. The tree generated during the best run of that set (one of ten runs) is applied to the unseen test data.

Results obtained by the GA-DT system have been compared with two other sets of results. The first one was obtained by using all features (105 for facial data). To generate the second result a set with features reduced to the same number as the one produced by the GA-DT experiment was used. This reduction was achieved by an independent ranking of each feature using an information theory based entropy measure (infomax) to estimate which features are the most discriminatory. Features that lead to the greatest reduction in the estimated measure of the training examples are chosen. The exact criterion is to choose that feature vector X with values {xv1, xv2,..., xvm} that minimizes the expression over n classes, where xi+ is the number of examples in a given class with values xvi, and xi- is the number of negative examples (all the examples not belonging to this class) with the value xvi.

Learning Data

RANDOM GENERATION

OF TWO DATASETS

60%

 

40%

 

60%

 

 

40%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Evaluation

 

Training

 

Evaluation

 

Training

 

 

 

Data

 

Data

 

Data

 

Data

 

DATA SET 1

 

 

DATA SET 2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10 RUNS OF GA-ID3

 

 

10 RUNS OF GA-ID3

WITH DIFFERENT SEEDS FOR

 

WITH DIFFERENT SEEDS FOR

RANDOM GENERATOR

 

 

RANDOM GENERATOR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SELECTION OF THE BEST DATASET BASED

ON ITS AVERAGE PERFORMANCE OVER 10 RUNS

SELECTION OF THE BEST RUN FOR THIS SET

Final Tree

Unseen Testing Data

 

Results

Figure 5.4: Setup for GA-DT Experiments

Fig 5.5 shows average performance on the evaluation sets over 10 runs using the GA-DT system. Table 5.4 shows the results of the GA-DT experiment together with the corresponding performance for (i) the set with all the features and (ii) the set with features reduced to the number obtained in the GA-DT experiment and ranked using the

43

(entropy) infomax measure. The results obtained in experiment show an improvement of recognition rate (lower error rate) has been observed for sets with features reduced by the GA-DT system. The number of features was reduced by 60% for the facial data.

Error [%]

36.5

 

 

 

 

 

 

 

 

 

36

 

 

 

 

 

 

 

1st. data set

35.5

 

 

 

 

 

 

 

2nd. data set

 

 

 

 

 

 

 

 

35

 

 

 

 

 

 

 

 

 

34.5

 

 

 

 

 

 

 

 

 

34

 

 

 

 

 

 

 

 

 

33.5

 

 

 

 

 

 

 

 

 

33

 

 

 

 

 

 

 

 

 

32.5

 

 

 

 

 

 

 

 

 

32

 

 

 

 

 

 

 

 

 

31.5

 

 

 

 

 

 

 

 

 

31

 

 

 

 

 

 

 

 

31.371

30.5

 

 

 

 

 

 

 

 

 

30

 

 

 

 

 

 

 

 

 

29.5

 

 

 

 

 

 

 

 

 

29

 

 

 

 

 

 

 

 

29.109

28.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

28

 

 

 

 

 

 

 

 

 

50

150

250

350

450

550

650

750

850

950

 

 

 

 

 

 

 

 

 

Trials

Figure 5.5: Average Error Rate for 10 Runs of Face Data Sets 1, 2

Feature Set

 

Reduced feature Set by the

 

Reduced feature Set by

 

 

 

 

GA-DT System

 

feature ranking

105 Features

 

41 Features

 

41 Best Features Chosen

Error Rate

 

Tree

 

Error Rate

 

Tree

 

Error Rate

 

Tree

 

 

Complexity

 

 

 

Complexity

 

 

 

Complexity

38.4%

 

8 nodes

27.5%

 

13 nodes

38.5%

 

11 nodes

 

 

 

 

 

 

 

 

 

 

 

Table 5.4 Experimental Results for Face Data

44

CHAPTER

6

бqв>гед9ж5зGикй лнмко>пGрлJстоесфу х5ц7ч9шъщ ыьщ

The ability to detect salient facial features is an important component of any face recognition system, in particular for normalization purposes and for the extraction of the image features to be used later on by the face classifiers. The detection of facial landmarks underlies attention mechanisms similar to those used by the human visual system (HVS) to screen out the visual field and to focus its attention on salient input characteristics. Among the many facial features available it appears that the eyes play the most important role in automated visual interpretation and human face recognition. As both the position of the eyes and the interoccular distance are relatively constant for most people, detecting the eyes serves first of all as an important role in face normalization and thus facilitates further localization of facial landmarks. It is eye detection that allows one to focus attention on salient facial configurations, to filter out structural noise, and to achieve eventual face recognition.

!# "

Yuille (1991) describes a complex but generic strategy, characteristic of the abstractive approach, for locating facial landmarks such as the eyes using the concept of a deformable template. The template is a parametrized geometric model of the face or part of it (mouth and/or eyes) to be recognized, together with a measure of how well it fits the image data, where variations in the parameters correspond to allowable deformations. Lam and Yan (1996) extended Yuille's method for extracting eye features by using corner locations inside the eye windows which are obtained by means of average anthropometric measures after the head boundary is first located. Deformable templates and the closely related elastic snake models are interesting concepts, but using them can be difficult in terms of learning and implementation, not to talk about their computational complexity. Eye features could also be detected using their symmetrical characteristics. As an example, Yeshurun and his colleagues (1991) developed a generalized symmetry operator for eye detection. As another example, Ducottet et al (1994) detects of the position of objects with circular symmetry and variable size in a noisy background using orthonormal wavelet transform of an image and then the correlation product of the image with a adapted representation of the object. Most recently, an attempt has been made to speed up the optimization process involved in symmetry detection using evolutionary computation. Gofman et al (1996) have introduced a global optimization approach for detecting the local reflectional symmetry in gray level images using 2D Gabor decomposition via soft windowing in frequency domain. Samaria (1994) employs stochastic

45

modeling, using Hidden Markov Models (HMMs), to holistically encode face information. When the frontal facial images are sampled using top-bottom scanning, the natural order in which the facial landmarks would appear is encoded using a HMM. The HMM leads to the efficient detection of eye strips and still leaves the task of homing on the eyes unsolved.

A similar behavior-based like approach for pattern classification and navigation tasks is suggested by the concept of visual routines (Ullman, 1984), recently referred to as a visual routine processor (VRP) by Horswill (1995). The VRP assumes the existence of a set of visual routines that can be applied to base image representations (maps), subject to specific functionalities, and driven by the task at hand. Recently, Johnson, Maes and Darell (1994) who have suggested using Genetic Programming for evolving visual routines whose perceptual task is finding the left and right hand of a person using binary images.

Most of the eye detection methods available now lack the size invariance characteristic as they assume a prespecified face size or would operate at several grid sizes and do the exhaustive search on the face images. Our novel strategy overcomes this drawback and achieves invariance through the evolution of 'navigation' routines, traffic collection, and consensus methods. We consider eye detection as a face recognition (routine) competency, whose inner workings include screening out the facial landscape to detect ('pop-out') salient locations for possible eye locations, and labeling as eye only those salient areas whose feature distribution 'fits' what the eye classifier has been trained to recognize. In other words, the first stage of the eye detection routine would implement attention mechanisms whose goal is to create a saliency 'eye' map, while the second stage processes the facial landscape filtered by the just created saliency map for actual eye recognition. Due to the complexity involved in eye detection, we introduce a novel strategy, which will expand on the abstractive approach using evolutionary techniques.

$ %& ')(* + ,,-/!0, .

The overall architecture, shown in Fig. 6.1, consists of two components whose tasks are those of eye saliency and eye recognition, respectively. The ('sensory-reactive') saliency component has to discover most likely eye locations, while the ('model-based') eye recognition component probes the suggested locations ('candidates') for actual eye detection.

Sensory & Reactive Level

Model-Based Level

 

 

 

 

 

 

Saliency Map

 

 

Eye Recognition

 

 

 

 

 

 

Where

 

What

Figure 6.1. Eye Location Using a Two-Stage Where and What Architecture

This section describes in detail how the tasks of feature extraction (and data compression), evolution of visual routines for conspicuity (map) derivation, and the integration of the output from the visual routines for defining the saliency map, characteristic of collective behavior, are mapped into computational models. The goals of the novel architecture for eye detection are twofold: (i) derivation of the saliency attention map using consensus between navigation routines encoded as finite state automata (FSA) evolved using GAs, (ii) selection of optimal features for eye classification using GAs and induction of DT (decision trees) for classifying the most salient locations identified

46

earlier as eye vs non-eye regions. Specifically, we describe what the image base representations are, how visual routines are encoded as FSA and evolved, and how consensus methods can integrate ('fuse') visual routine outputs during testing on unseen facial images.

1 23 2465 798;:<=7 @A7=@ BDC ?

The derivation of the saliency map (see Fig. 6.2) is described in terms of tasks involved, computational models, and corresponding functionalities. The tasks involved include feature extraction and data compression, conspicuity derivation, and integration of outputs from several visual routines. The corresponding computational model has access to feature maps, while GAs are used to generate finite state automata (FSA) as appropriate policy (action) functions. Consensus methods achieve the integration of the visual routines, characteristic of distributed multiagent's collective behavior, as the individual routines ('agents') navigate across the facial landscape trying to filter out non-eye regions and home towards most promising ('salient') eye locations.

Feature Extraction

 

 

Statistics (Mean,

 

 

s.d., and entropy)

 

 

 

 

 

 

 

Navigation

 

 

 

 

 

 

Conspicuity Maps

 

 

FSA and GA

 

Derivation

 

 

 

 

 

 

 

 

 

 

Saliency Map

 

 

 

 

 

 

Data Integration

 

 

Consensus

 

 

 

 

 

FUNCTIONALITY

 

 

 

 

 

 

 

 

 

 

TASK INVOLVED

 

COMPUTATIONAL

 

 

 

MODEL

Figure 6.2. Architecture of the Where Stage - the Saliency Map

G HI HIKJL M9NO PLRQ TNUV WAM;WMNL;PWAZXDY[9[9U\TM9NV W

The eye detection architecture for the derivation of base representations ('feature selection') and eye classification integrates genetic algorithms (GAs) and inductive generalization, characteristic of learning from examples as we described in Sect. 5.3. Fig. 5.3 illustrates our general approach. An initial but large list of low-level image features is provided together with the images from which the feature vectors ('examples') for both positive (eye) and negative (non-eye) classes are extracted. The GA is used to explore the space of all subsets of the given feature list where preference is given to those features subsets which achieve best detection performance using small dimensionality feature sets ('Occam's razor'). Each of the selected feature subsets is evaluated - its fitness measured - using the induction of decision trees. The above process is iterated along evolutionary lines and the best feature subset found is then recommended to be used in the actual design of the corresponding facial landmark detection system. Note that the examples referred in Fig. 5.3 are represented by all the features from the original list, while the training and tuning data consists of only those features passed through the feature subset controller as a result of natural selection.

The learning strategy, conceptually akin to cross-validation (CV) and bootstrapping, draws randomly from the original list of examples, both positive and negative, and generates two labeled sets consisting of training and tuning data. The training data is used to induce trees, while the tuning data is used to evaluate the trees and generate

47

appropriate fitness measures. Once the evolution stops the best tree is frozen and it is used for testing on unseen facial imagery to assess on how well the visual routine performs on eye detection.

The decision tree (DT) induction process is applied to training data and it generates decision trees for each of the given classes. The evaluation procedure assesses the fitness of each feature subset, as given by its corresponding induced tree, using now tuning data sets and employing an integrated measure consisting of both the number of features used and the error rate. This measure of fitness is then passed to the Genetic Module stage to further evolve optimal candidate feature subsets. The whole process is initialized by randomly selecting subsets of feature and making them available to DT. Note that detection of the eyes, respectively, is successful if the corresponding region is reached. For the case one is interested to detect only the center of the landmark region, penalty factors (i) multiple detections, and (ii) the distance between the center of the landmark and the actual detection, should be also included.

A description generated by a decision tree induction process such as decision tree (DT), is both descriptive and procedural. The descriptive representation is used to evaluate a given subset of features by measuring the error rate on the tuning data. The procedural representation describes the way in which a given subset of feature should be applied (i.e. not only it describes the feature subset itself but also an order in which the features should be applied). Since the DT learning process determines the feature which is most discriminatory in the recursive process of dichotomizing the training data, the procedural representation is inherently embedded in the decision tree. This procedural descriptiveness of the evaluation process is the main motivation to use decision trees. In order to implement the Genetic Module and DT we use GENESIS (GENEtic Search Implementation System), and C4.5, respectively.

] ^a` b c di j!k

This section describes in detail how the tasks of feature extraction (and data compression), evolution of visual routines for conspicuity (map) derivation, and the integration of the output from the visual routines for defining the saliency map, characteristic of collective behavior, are mapped into computational models. Two experiments were implemented to achieve the eye detection task. In the first experiment we use the method of exhaustive search around entire face image to find the eye location after the optimal eye feature subset is found. The second experiment using a new strategy overcomes the drawback of exhaustive search and achieves invariance through the evolution of 'navigation' routines. Its goal has twofold: (i) derivation of the saliency attention map using consensus between navigation routines encoded as finite state automata (FSA) evolved using GAs, (ii) selection of optimal features for eye classification using GAs and induction of DT (decision trees) for classifying the most salient locations identified earlier as eye vs non-eye regions. Specifically, we describe what the image base representations are, how visual routines are encoded as FSA and evolved, and how consensus methods can integrate ('fuse') visual routine outputs during testing on unseen facial images. Experimental results, using 35 face images from the FERET database results, are provided in a stepwise fashion as we describe our particular implementation and show the feasibility of our hybrid approach. The facial imagery used is drawn from the FERET facial database. Since large amounts of FERET images were acquired during different photo sessions, the lighting conditions and the size of the facial images can vary. The diversity of the FERET database is across gender, race, and age.

l mn mo pq rtsvwx!ytzr |uw}=~;rhv |r };

48

This section describes the data used, the original list of possible features, and the detection architectures used to train and test the eye detection routine. The facial images are of 64x64 resolution - Fig. 6.3 shows examples of facial images used in our experiments. The database consists of 120(+) eye examples. As training involves negative examples as well, the database also includes 480(-) non-eye examples drawn randomly across the facial landscape. The resolution for both the positive and negative eye examples is 24x16. The feature list for each 24x16 window consists initially of one hundred and forty seven features {x1 x2,..., x147}. The set of features, each of them measured over 6x4 windows and using two pixels overlap, is:

-x1 to x49: means for each window

-x50 to x98: entropies for each window

-x99 to x147: means for each window after applying the Laplacian over 24x16 frames Fig. 6.4 illustrates the feature extraction process for training and tuning examples.

Once training is completed exhaustive testing ('search') of the detection routine learned is performed across 58 unseen facial images. Testing takes place as 24x16 search windows are shifted four pixels to cover the entire facial image. Each test image generates 143 examples the eye detection routine is asked to decide on. The detection strategy includes cross validation (CV) and winner-take-all (WTA).

Figure 6.3 Examples of Face Images

Training Facial images

 

 

 

 

 

24 pixels

 

 

 

 

 

 

 

 

 

 

6 pixels

 

 

 

 

 

 

pixels

 

 

 

64 pixels

4

 

 

16 pixels

 

 

 

 

 

 

 

 

 

 

 

 

 

LaplacianMean

Entropy

MeLaplaciann

 

64 pixels

Entropy

Mean of

Generate 147 features

{x1, x2,..., x147}

49

Figure 6.4. Feature Extraction for Decision Trees

The experiment carried out to assess the feasibility of our hybrid genetic approach for eye detection using the general architecture described in Sect. 6.2. Each generation keeps a constant population size of fifty individuals, the crossover rate and mutation rate are 0.6 and 0.001, respectively.

The set of six hundred examples, 120(+) eye examples. and 480(-) non-eye examples, is divided into three equal subsets for cross validation training and tuning, and a tournament consisting of three sets of CV rounds takes place. The corresponding error rates, including both false positive - false detection of eyes - and false negative - missing the eyes, on tuning data, are shown as fitness measures in Fig. 6.4. The feature subset corresponding to the tree derived from the third CV round, which achieved the smallest - 4.87% - error rate, consists of only 60 of the original 147 features. This feature subset is used to evaluate the performance for eye detection on unseen test examples.

Error Rate (%)

 

 

 

 

 

 

 

 

9

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1st CV round

 

 

 

 

 

 

 

 

 

2nd CV round

 

8

 

 

 

 

 

 

 

3rd CV round

 

7

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

6.26 %

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

4.90 %

 

 

 

 

 

 

 

 

 

 

 

4.87 %

4

 

 

 

 

 

 

 

 

 

 

Trials

0

100

200

300

400

500

600

700

800

900

1000

Figure 6.4. Fitness Measures for Learning on Database.

Table 6.1 gives the average confusion matrix for the eye detection task. These statistics were obtained by visually inspecting for the detection of eye regions. Fig. 6.5 shows the examples for eye detection.

The experiment performed so far detects clusters of candidates. As a consequence post processing is warranted and in our case WTA (Winner Take All) is used. The WTA procedure clusters adjacent candidates, find their corresponding centers, and filters out those who fail pairwise distance constraints, like those holding between the two eyes. Correct performance results for the case when the center of the cluster is within four pixels from the true eye location. The motivation for using the four pixel distance comes from testing taking place over 24x16 search windows shifted each time four pixels in order to cover the entire facial image. The results obtained should be assessed with the understanding that: (i) only two candidates, corresponding to the left and right eye, are allowed and misses could occur - for our testing we missed one eye in two out of 58 test images; and (ii) the false positive rate is evaluated now in terms of missing the eyes rather than for frames all across the facial landscape. The

50