Стратегии детекции в рампознавании лиц / Detection Strategies For Face Recognition Using Learning And Evolution Phd Dissertation 1998

Добавил:

Studfiles2 Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Компьютерные методы идентификации личности

Файл:

Full Feature Set

Optimal Feature Subset

Fitness Measure

Feature Subsets

.pdf

Скачиваний:

Добавлен:

01.05.2014

Размер:

1.05 Mб

Скачать

☆

<<< < Предыдущая 1 2 3 45 / 75 6 7 > Следующая >>>

gene {0,1} indicates whether or not the ith feature from the overall feature set is included in the specified feature subset. Thus, each individual in a GA population consists of fixed-length binary string representing some subset of the given feature set. The advantage of this representation is that a standard and well understood GA can be used without any modification.

Each member of the current GA population represents a competing feature subset that must be evaluated to provide fitness feedback to the evolutionary process. This is achieved by invoking C4.5 with the specified feature subset and a set of training data (reduced to include only the feature values of the specified features). The decision tree is then tested for classification accuracy on a set of unseen evaluation data. Its accuracy together with the size of the feature subset is used as the GA fitness measure.

The fitness functions for feature subsets, based on information content, orthogonality, etc., however, can still provide baseline feedback to stochastic search methods characteristic of evolutionary computation (EC), resulting in enhanced performance. As an example, GAs can search the space of all possible subsets of a large set of candidate discrimination features, capture important non-linear interactions, and thus improve on the quality of the feature subsets produced by using ranking methods only. The effectiveness of the features selected on the actual pattern recognition task can be then assessed by learning appropriate classifiers and measuring their observed performance. Learning has the ability to smooth the fitness landscape, thus facilitating evolution, and eventually what is learned becomes clamped ('phenotype rigidity'). Fig. 5.3 shows how the interplay between learning and evolution takes place in the context of pattern recognition.

Feature subsets selected by the GA module for evaluation are used by the filter sub-module ('Feature Mask') to preprocess the training data and describe it only in terms of those features. The filter sub-module also divides the learning examples into the training and testing data, respectively. The training data is used by the le a r ni n g submodule, while the testing data is used by the evaluation module to assess the (local) fitness of what has been learned so far. As the amount of training examples increases, there is more opportunity for learning and an increased potential effect on evolution. Too much learning ('overfit'), however, can result in loss of generality and negatively affect the overall robustness of the results.

	EVALUATION MODULE
	FOM
	(Figure-of-Merit)
+	Classifier Evaluation	Classifier	Classifier Learning
+	Classifier Evaluation		Classifier Learning
		Testing Data
	Feature Set Size		Training Data
	Feature Set Size

Image Data	Feature Extraction	Examples
		Feature Mask

Figure 5.3. Eye Recognition: Symbiotic Adaptation for Pattern Classification

Using Learning and Evolution

Our belief is that such a hybrid learning system will identify significantly better feature subsets than those produced by existing methods for two reasons. First, we are exploiting the power of GAs to efficiently explore the non-linear interactions of a given set of features. Second, by using C4.5 in the evaluation loop, we have an efficient mechanism for directly measuring classification accuracy.

In order to test our ideas we have implemented a prototype version of the system. For the GA component, we used without modification GENESIS (Grefenstette, 1991), a standard GA implementation. Similarly, we used without modification C4.5, a standard implementation of Decision Tree (Quinlan, 1986), to build up the decision trees for the evaluation procedure. For both components, standard default parameter settings from the literature were used. For the GA module, this resulted in a constant population size of 50, a crossover rate 0.6 and a mutation rate 0.001. For C4.5, the pruning confidence level was set to default 25%.

Ô ÕÖCÕ×Ø!Ù Ú ÛÜÝ+Ú Þ

The ability to detect salient facial features is an important component of any face recognition system. Eye detection serves an important role in face normalization and thus facilitates further localization of facial landmarks. It allows one to focus attention on salient facial configurations, to filter out structural noise, and to achieve eventual face recognition. For these purposes, we first applied the hybrid learning system to accomplish feature selection on facial landmark classification.

102 eye, 102 nose and 102 other facial region examples are made available as learning and 52 eyes, 52 nose and 52 other facial region examples are used as test data in our experiments. The original eye image, nose image or the image of other regions has resolution 16 by 12 (column by row) pixels, which is cut from human face images (64x72 pixels). 105 features are produced for each example. The configuration of those features follows the rules listed in Table 5.2, while Table 5.3 gives the characteristics of the face data.

Features		Description

From x1 to x35		Obtained by averaging a 4x4 window which has 50% (2 pixels)
		overlap in each shift.
From x36	to x70	They represent the standard deviation of a 4x4 window.
From x71	to x105	They represent the entropy for a 4x4 window.

Table 5.2 Configuration of 105 features

Characteristics

Number of Examples

Number of Features

Feature Value

Number of Classes

Description

Learning set of 102 examples and test set of 52 examples.

105 features.

The feature value is numerical, in the range 0 to 255.

There are 3 decision classes: 1 to 3. (eye, nose, and others)

Table 5.3 Characteristics of the Facial Data

An initial set of experiments has been performed to assess the performance of the hybrid GA-DT learning system. Subsets of optimal features for recognizing visual concepts in facial image data have been learned and

compared against standard methods for feature selection. The error rates on unseen image data and the tree description complexity (measured as the number of nodes) have been used as the basis for comparison.

In order to apply cross validation of the learning process, the learning data was randomly shuffled to generate two sets (Fig. 5.4). In each set 40% of examples were used for inducing decision trees and the other 60% for evaluation of the learned description. Ten experiments were performed on each set in order to find the best average performance. The set that produced the best result, i.e. the lowest error rate, was selected as the training set. The tree generated during the best run of that set (one of ten runs) is applied to the unseen test data.

Results obtained by the GA-DT system have been compared with two other sets of results. The first one was obtained by using all features (105 for facial data). To generate the second result a set with features reduced to the same number as the one produced by the GA-DT experiment was used. This reduction was achieved by an independent ranking of each feature using an information theory based entropy measure (infomax) to estimate which features are the most discriminatory. Features that lead to the greatest reduction in the estimated measure of the training examples are chosen. The exact criterion is to choose that feature vector X with values {xv1, xv2,..., xvm} that minimizes the expression over n classes, where xi+ is the number of examples in a given class with values xvi, and xi- is the number of negative examples (all the examples not belonging to this class) with the value xvi.

Learning Data

RANDOM GENERATION

OF TWO DATASETS

60%

40%

60%

40%

Evaluation

Training

Evaluation

Training

Data

DATA SET 1

DATA SET 2

10 RUNS OF GA-ID3

WITH DIFFERENT SEEDS FOR

RANDOM GENERATOR

SELECTION OF THE BEST DATASET BASED

ON ITS AVERAGE PERFORMANCE OVER 10 RUNS

SELECTION OF THE BEST RUN FOR THIS SET

Final Tree	Unseen Testing Data

Results

Figure 5.4: Setup for GA-DT Experiments

Fig 5.5 shows average performance on the evaluation sets over 10 runs using the GA-DT system. Table 5.4 shows the results of the GA-DT experiment together with the corresponding performance for (i) the set with all the features and (ii) the set with features reduced to the number obtained in the GA-DT experiment and ranked using the

(entropy) infomax measure. The results obtained in experiment show an improvement of recognition rate (lower error rate) has been observed for sets with features reduced by the GA-DT system. The number of features was reduced by 60% for the facial data.

Error [%]

36.5
36								1st. data set
35.5								2nd. data set
								2nd. data set
35
34.5
34
33.5
33
32.5
32
31.5
31									31.371
30.5
30
29.5
29									29.109
28.5									29.109
28.5
28
50	150	250	350	450	550	650	750	850	950
									Trials

Figure 5.5: Average Error Rate for 10 Runs of Face Data Sets 1, 2

Feature Set			Reduced feature Set by the			Reduced feature Set by
			GA-DT System			feature ranking
105 Features			41 Features			41 Best Features Chosen
Error Rate	Tree		Error Rate	Tree		Error Rate	Tree
	Complexity			Complexity			Complexity
38.4%	8 nodes	27.5%		13 nodes	38.5%		11 nodes

Table 5.4 Experimental Results for Face Data

CHAPTER

бqв>гед9ж5зGикй лнмко>пGрлJстоесфу х5ц7ч9шъщ ыьщ

The ability to detect salient facial features is an important component of any face recognition system, in particular for normalization purposes and for the extraction of the image features to be used later on by the face classifiers. The detection of facial landmarks underlies attention mechanisms similar to those used by the human visual system (HVS) to screen out the visual field and to focus its attention on salient input characteristics. Among the many facial features available it appears that the eyes play the most important role in automated visual interpretation and human face recognition. As both the position of the eyes and the interoccular distance are relatively constant for most people, detecting the eyes serves first of all as an important role in face normalization and thus facilitates further localization of facial landmarks. It is eye detection that allows one to focus attention on salient facial configurations, to filter out structural noise, and to achieve eventual face recognition.

!# "

Yuille (1991) describes a complex but generic strategy, characteristic of the abstractive approach, for locating facial landmarks such as the eyes using the concept of a deformable template. The template is a parametrized geometric model of the face or part of it (mouth and/or eyes) to be recognized, together with a measure of how well it fits the image data, where variations in the parameters correspond to allowable deformations. Lam and Yan (1996) extended Yuille's method for extracting eye features by using corner locations inside the eye windows which are obtained by means of average anthropometric measures after the head boundary is first located. Deformable templates and the closely related elastic snake models are interesting concepts, but using them can be difficult in terms of learning and implementation, not to talk about their computational complexity. Eye features could also be detected using their symmetrical characteristics. As an example, Yeshurun and his colleagues (1991) developed a generalized symmetry operator for eye detection. As another example, Ducottet et al (1994) detects of the position of objects with circular symmetry and variable size in a noisy background using orthonormal wavelet transform of an image and then the correlation product of the image with a adapted representation of the object. Most recently, an attempt has been made to speed up the optimization process involved in symmetry detection using evolutionary computation. Gofman et al (1996) have introduced a global optimization approach for detecting the local reflectional symmetry in gray level images using 2D Gabor decomposition via soft windowing in frequency domain. Samaria (1994) employs stochastic

modeling, using Hidden Markov Models (HMMs), to holistically encode face information. When the frontal facial images are sampled using top-bottom scanning, the natural order in which the facial landmarks would appear is encoded using a HMM. The HMM leads to the efficient detection of eye strips and still leaves the task of homing on the eyes unsolved.

A similar behavior-based like approach for pattern classification and navigation tasks is suggested by the concept of visual routines (Ullman, 1984), recently referred to as a visual routine processor (VRP) by Horswill (1995). The VRP assumes the existence of a set of visual routines that can be applied to base image representations (maps), subject to specific functionalities, and driven by the task at hand. Recently, Johnson, Maes and Darell (1994) who have suggested using Genetic Programming for evolving visual routines whose perceptual task is finding the left and right hand of a person using binary images.

Most of the eye detection methods available now lack the size invariance characteristic as they assume a prespecified face size or would operate at several grid sizes and do the exhaustive search on the face images. Our novel strategy overcomes this drawback and achieves invariance through the evolution of 'navigation' routines, traffic collection, and consensus methods. We consider eye detection as a face recognition (routine) competency, whose inner workings include screening out the facial landscape to detect ('pop-out') salient locations for possible eye locations, and labeling as eye only those salient areas whose feature distribution 'fits' what the eye classifier has been trained to recognize. In other words, the first stage of the eye detection routine would implement attention mechanisms whose goal is to create a saliency 'eye' map, while the second stage processes the facial landscape filtered by the just created saliency map for actual eye recognition. Due to the complexity involved in eye detection, we introduce a novel strategy, which will expand on the abstractive approach using evolutionary techniques.

$ %& ')(* + ,,-/!0, .

The overall architecture, shown in Fig. 6.1, consists of two components whose tasks are those of eye saliency and eye recognition, respectively. The ('sensory-reactive') saliency component has to discover most likely eye locations, while the ('model-based') eye recognition component probes the suggested locations ('candidates') for actual eye detection.

Sensory & Reactive Level		Model-Based Level

	Saliency Map		Eye Recognition

	Where		What

Figure 6.1. Eye Location Using a Two-Stage Where and What Architecture

earlier as eye vs non-eye regions. Specifically, we describe what the image base representations are, how visual routines are encoded as FSA and evolved, and how consensus methods can integrate ('fuse') visual routine outputs during testing on unseen facial images.

1 23 2465 798;:<=7 @A7=@ BDC ?

The derivation of the saliency map (see Fig. 6.2) is described in terms of tasks involved, computational models, and corresponding functionalities. The tasks involved include feature extraction and data compression, conspicuity derivation, and integration of outputs from several visual routines. The corresponding computational model has access to feature maps, while GAs are used to generate finite state automata (FSA) as appropriate policy (action) functions. Consensus methods achieve the integration of the visual routines, characteristic of distributed multiagent's collective behavior, as the individual routines ('agents') navigate across the facial landscape trying to filter out non-eye regions and home towards most promising ('salient') eye locations.

Feature Extraction		Statistics (Mean,
Feature Extraction	s.d., and entropy)
			Navigation

Conspicuity Maps		FSA and GA
Derivation
			Saliency Map

Data Integration		Consensus
Data Integration		Consensus	FUNCTIONALITY


TASK INVOLVED	COMPUTATIONAL
		MODEL

Figure 6.2. Architecture of the Where Stage - the Saliency Map

G HI HIKJL M9NO PLRQ TNUV WAM;WMNL;PWAZXDY[9[9U\TM9NV W

The eye detection architecture for the derivation of base representations ('feature selection') and eye classification integrates genetic algorithms (GAs) and inductive generalization, characteristic of learning from examples as we described in Sect. 5.3. Fig. 5.3 illustrates our general approach. An initial but large list of low-level image features is provided together with the images from which the feature vectors ('examples') for both positive (eye) and negative (non-eye) classes are extracted. The GA is used to explore the space of all subsets of the given feature list where preference is given to those features subsets which achieve best detection performance using small dimensionality feature sets ('Occam's razor'). Each of the selected feature subsets is evaluated - its fitness measured - using the induction of decision trees. The above process is iterated along evolutionary lines and the best feature subset found is then recommended to be used in the actual design of the corresponding facial landmark detection system. Note that the examples referred in Fig. 5.3 are represented by all the features from the original list, while the training and tuning data consists of only those features passed through the feature subset controller as a result of natural selection.

The learning strategy, conceptually akin to cross-validation (CV) and bootstrapping, draws randomly from the original list of examples, both positive and negative, and generates two labeled sets consisting of training and tuning data. The training data is used to induce trees, while the tuning data is used to evaluate the trees and generate

appropriate fitness measures. Once the evolution stops the best tree is frozen and it is used for testing on unseen facial imagery to assess on how well the visual routine performs on eye detection.

The decision tree (DT) induction process is applied to training data and it generates decision trees for each of the given classes. The evaluation procedure assesses the fitness of each feature subset, as given by its corresponding induced tree, using now tuning data sets and employing an integrated measure consisting of both the number of features used and the error rate. This measure of fitness is then passed to the Genetic Module stage to further evolve optimal candidate feature subsets. The whole process is initialized by randomly selecting subsets of feature and making them available to DT. Note that detection of the eyes, respectively, is successful if the corresponding region is reached. For the case one is interested to detect only the center of the landmark region, penalty factors (i) multiple detections, and (ii) the distance between the center of the landmark and the actual detection, should be also included.

A description generated by a decision tree induction process such as decision tree (DT), is both descriptive and procedural. The descriptive representation is used to evaluate a given subset of features by measuring the error rate on the tuning data. The procedural representation describes the way in which a given subset of feature should be applied (i.e. not only it describes the feature subset itself but also an order in which the features should be applied). Since the DT learning process determines the feature which is most discriminatory in the recursive process of dichotomizing the training data, the procedural representation is inherently embedded in the decision tree. This procedural descriptiveness of the evaluation process is the main motivation to use decision trees. In order to implement the Genetic Module and DT we use GENESIS (GENEtic Search Implementation System), and C4.5, respectively.

] ^a` b c di j!k

This section describes in detail how the tasks of feature extraction (and data compression), evolution of visual routines for conspicuity (map) derivation, and the integration of the output from the visual routines for defining the saliency map, characteristic of collective behavior, are mapped into computational models. Two experiments were implemented to achieve the eye detection task. In the first experiment we use the method of exhaustive search around entire face image to find the eye location after the optimal eye feature subset is found. The second experiment using a new strategy overcomes the drawback of exhaustive search and achieves invariance through the evolution of 'navigation' routines. Its goal has twofold: (i) derivation of the saliency attention map using consensus between navigation routines encoded as finite state automata (FSA) evolved using GAs, (ii) selection of optimal features for eye classification using GAs and induction of DT (decision trees) for classifying the most salient locations identified earlier as eye vs non-eye regions. Specifically, we describe what the image base representations are, how visual routines are encoded as FSA and evolved, and how consensus methods can integrate ('fuse') visual routine outputs during testing on unseen facial images. Experimental results, using 35 face images from the FERET database results, are provided in a stepwise fashion as we describe our particular implementation and show the feasibility of our hybrid approach. The facial imagery used is drawn from the FERET facial database. Since large amounts of FERET images were acquired during different photo sessions, the lighting conditions and the size of the facial images can vary. The diversity of the FERET database is across gender, race, and age.

l mn mo pq rtsvwx!ytzr |uw}=~;rhv |r };

This section describes the data used, the original list of possible features, and the detection architectures used to train and test the eye detection routine. The facial images are of 64x64 resolution - Fig. 6.3 shows examples of facial images used in our experiments. The database consists of 120(+) eye examples. As training involves negative examples as well, the database also includes 480(-) non-eye examples drawn randomly across the facial landscape. The resolution for both the positive and negative eye examples is 24x16. The feature list for each 24x16 window consists initially of one hundred and forty seven features {x1 x2,..., x147}. The set of features, each of them measured over 6x4 windows and using two pixels overlap, is:

-x1 to x49: means for each window

-x50 to x98: entropies for each window

-x99 to x147: means for each window after applying the Laplacian over 24x16 frames Fig. 6.4 illustrates the feature extraction process for training and tuning examples.

Once training is completed exhaustive testing ('search') of the detection routine learned is performed across 58 unseen facial images. Testing takes place as 24x16 search windows are shifted four pixels to cover the entire facial image. Each test image generates 143 examples the eye detection routine is asked to decide on. The detection strategy includes cross validation (CV) and winner-take-all (WTA).

Figure 6.3 Examples of Face Images

Training Facial images
			24 pixels

		6 pixels
		pixels
64 pixels		4			16 pixels
64 pixels					16 pixels

		LaplacianMean	Entropy		MeLaplaciann
	64 pixels	Entropy		Mean of

Generate 147 features

{x1, x2,..., x147}

Figure 6.4. Feature Extraction for Decision Trees

The experiment carried out to assess the feasibility of our hybrid genetic approach for eye detection using the general architecture described in Sect. 6.2. Each generation keeps a constant population size of fifty individuals, the crossover rate and mutation rate are 0.6 and 0.001, respectively.

The set of six hundred examples, 120(+) eye examples. and 480(-) non-eye examples, is divided into three equal subsets for cross validation training and tuning, and a tournament consisting of three sets of CV rounds takes place. The corresponding error rates, including both false positive - false detection of eyes - and false negative - missing the eyes, on tuning data, are shown as fitness measures in Fig. 6.4. The feature subset corresponding to the tree derived from the third CV round, which achieved the smallest - 4.87% - error rate, consists of only 60 of the original 147 features. This feature subset is used to evaluate the performance for eye detection on unseen test examples.

Error Rate (%)
9
								1st CV round
								2nd CV round
8								3rd CV round
7
6											6.26 %
6
5											4.90 %
											4.87 %
4											Trials
4	0	100	200	300	400	500	600	700	800	900	1000

Figure 6.4. Fitness Measures for Learning on Database.

Table 6.1 gives the average confusion matrix for the eye detection task. These statistics were obtained by visually inspecting for the detection of eye regions. Fig. 6.5 shows the examples for eye detection.

The experiment performed so far detects clusters of candidates. As a consequence post processing is warranted and in our case WTA (Winner Take All) is used. The WTA procedure clusters adjacent candidates, find their corresponding centers, and filters out those who fail pairwise distance constraints, like those holding between the two eyes. Correct performance results for the case when the center of the cluster is within four pixels from the true eye location. The motivation for using the four pixel distance comes from testing taking place over 24x16 search windows shifted each time four pixels in order to cover the entire facial image. The results obtained should be assessed with the understanding that: (i) only two candidates, corresponding to the left and right eye, are allowed and misses could occur - for our testing we missed one eye in two out of 58 test images; and (ii) the false positive rate is evaluated now in terms of missing the eyes rather than for frames all across the facial landscape. The

<<< < Предыдущая 1 2 3 45 / 75 6 7 > Следующая >>>