Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Стратегии детекции в рампознавании лиц / Detection Strategies For Face Recognition Using Learning And Evolution Phd Dissertation 1998

.pdf
Скачиваний:
17
Добавлен:
01.05.2014
Размер:
1.05 Mб
Скачать

The feasibility of our AVBPA architecture has been tested on three FERET video sequences acquired in different lighting conditions. The first sequence (one subject) was taken indoors, the second sequence (two subjects) was taken outdoors, and the third (outdoor) sequence was taken during stormy conditions so it displays low signal- to-noise ratio. The three sequences are shown in Fig. 4.9.

The goal for each of the three sequences is to detect the moving human subject, locate its face, and verify its identity (ID), i.e., if he/she ('probe') belongs to the given database (DB) of subjects ('gallery'). The generic procedure involves (a) video skim (using reduced sampling rate), (b) or (b) subject detection (using difference or optical flow methods), (c) face detection (face location, face refinement, and face normalization), and (d) authentication (using RBF). The (a) through (d) indicators correspond to the processing points shown in Fig. 4.10. The three video sequences were acquired at a rate of 30 frames/sec. Video skim was achieved by subsampling the sequences at a frame rate of 6 frames/sec yielding, 65, 37, and 22 frames respectively.

frame 0

frame 30

frame 175

frame 200

frame 250

frame 275

frame 300

frame 325

 

 

 

 

 

 

 

 

frame 0

frame 15

frame 65

frame 120

frame 150

frame 160

frame 170

frame 185

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

frame 0

frame 25

frame 45

frame 70

frame 85

frame 90

frame 100

frame 110

 

 

 

 

 

 

 

 

Figure 4.9 Video Sequences

õ ö÷Pöøúù ûütý)þ ÿèü )ýÿ ÿý3ü xÿ

Two methods have been implemented for tracking and detection human subjects and their activation is decided based upon the noise level. The first method, that of differences, is cheaper, but works only on those video sequences displaying a relatively high signal-to-noise ratio. The second method, that of optical flow, becomes necessary when the signal to noise ration falls below a predetermined threshold. As the difference methods implements a straightforward pairwise difference between two consecutive video frames we briefly describe next only the optical flow method.

Image motion results from the projection of an object's 3-D motion onto a 2-D image plane. The so-called optical flow (image flow) is the apparent motion of an image pattern in the image plane and it corresponds to a velocity field. It is well known that people can easily perceive and track object's motion. The perception of visual motion involves two types of computations, those of temporal changes and spatial integration. The well-known intensity gradient model proposed by Horn and Schunck (1980) applies differential operations to both the spatial and temporal dimensions. Since natural images are not always differentiable, the intensity gradient model usually requires pre-smoothing of images. Therefore, the intensity gradient model includes a spatial smoothing filter

31

followed by a time differentiation. Assuming velocities vary smoothly everywhere, Horn and Schunk use a global

smoothness constraint to resolve the optical flow constraint equation

Exu+Eyv+Et=0.

(4.2)

The equation is called the optical flow constraint equation, where the spatial and temporal derivatives (Ex, Ey), and

Et are estimated from the video sequence. As the optical flow constraint equation is overdetermined it can be solved in the least squared error sense using the global smoothness assumptions mentioned above by minimizing the error

function

Error(u,v)=||E

x

u+E

y

v+E ||2.

(4.3)

 

 

t

 

For the purpose of moving object detection Tsao and Chen (1993) have suggested the following approach. To detect (smaller) moving objects, rather than computing explicitly the optical flow, it is better to consider just the error map Error (u, v) of the optical flow. For the case of moving objects, the image is usually not differentiable around the locations of the objects. As a consequence, the intensity gradient method may generate larger errors of the optical flow around the locations of those moving objects. This error map can then serve as an indicator of moving objects. The optical flow method takes advantage of the error (energy) of the optical flow. By thresholding at a certain energy level, the errors due to clutter and noise can be mostly removed and only the large errors due to the moving objects remain. This method works very well at very low signal-to-noise/clutter ratios while the difference method would fail on similar video sequences.

"! $#%!& ' .()*+', -

The corresponding outputs of subject detection using difference (b) or optical flow (b’ ) methods for the three sequences are shown in Fig 4. 3 and Fig 4.4, respectively. As it can be seen from Fig 4.3, the difference method fails in detecting the subject for the third sequence. This should come at no surprise because the trees in the background are no longer stationary as it was the case in the second sequence. The preprocessing stage actually assessed as low the signal-to-noise ratio of this sequence and as a consequence the optical flow, rather than the difference method would be successfully used. Note also that while all the outputs (b’ ) of the three sequences were successfully processed using the optical flow method the extra costs involved in using optical flow can be justified only for those cases when the signal-to-noise ratio is low and the difference method would thus fail.

Once a frame corresponding to a moving subject has been detected, that frame is processed to locate the face and surround it by a limited size box (Huang, 1996). First, one attempts to find a rough approximation for the possible location of the face box. This is achieved using horizontal and vertical projections along the x and y axes, respectively. The resulting profiles are searched for their maximum and minimum values using model-based ('face') constraints. The horizontal projection profile is searched for its extrema points by first detecting all its local extrema (min and max) points, and then measuring the distance d between each min and the following max peak. The top side of the boundary box is placed at the max peak location yielding the largest d distance. A similar procedure is used to analyze the vertical projection and yields the left and right sides of an approximating but still tentative surrounding box. The box is finally completed on its bottom side such that the ratio between the vertical and horizontal dimensions of the box is 1.5 (BOX1). This procedure can process faces of different sizes.

32

More precise face location takes now place within BOX1 using decision trees (DT). The DT are learned ('induced') from both positive ('face') and negative ('non-face') examples expressed in terms of extracted features such as entropy, mean, and standard deviation, yielding thirty (30=5x3x2) features values. The data set used to train the DT on the face detection task comes from twelve images whose resolution is 256x384 and consists of 2759 windows (CORRECT / '+') examples, and 15673 (INCORRECT / '-') examples. The DTs are run across 8x8 windows from BOX1. As a result of this procedure each 8x8 window from BOX2 is now labeled as face or nonface.

The output from the face location stage consists of labeled 8x8 (face vs. non-face) windows. Horizontal and vertical profiles (projection analysis) count now the number of face labels and are thresholded to yield the extent of the face image, if present at all, and provide a new and refined approximation for the face as BOX2. Small counts in both the horizontal and vertical profiles indicate the absence of a face. Normalization takes BOX2, if found at all, and produces a standard size BOX3, available for authentication.

diff-map 30

diff-map

175

diff-map 200

diff-map

250

diff-map 275

diff-map 300

diff-map 325

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

diff-map 15

diff-map

65

diff-map 120

diff-map

150

diff-map 160

diff-map 170

diff-map 185

 

 

 

 

 

 

 

 

 

diff-map 5

diff-map 45

diff-map 70

diff-map 85 diff-map 90

diff-map 100 diff-map 110

 

 

 

 

 

Figure 4.3 Detection of the Moving Subject using the Difference Method

error-map 15

error-map 30

error-map 175

error-map 200

error-map 250

error-map 275

error-map 300

error-map 325

 

 

 

 

 

 

 

 

error-map 5

error-map 15

error-map 50

error-map 80

error-map 120

error-map 140

error-map 160 error-map 185

 

 

 

 

 

 

 

error-map 5

error-map 25

error-map 45

error-map 70

error-map 85

error-map 90

error-map 100

error-map 110

 

 

 

 

 

 

 

 

Figure 4.4 Detection of the Moving Subject using Optical Flow

33

The full stepwise (b - d) results obtained for the three sequences are displayed next. Positive and negative authentication of the probe is indicated using + and - signs (Fig. 4.5.), respectively. The probe for the firstsequence belongs to the gallery, for the second sequence, the left person belongs to the gallery and the right person does not belong to the gallery, while for the third sequence the probe belongs to the gallery. The corresponding FERET (still) gallery consists of a total of 20 images. Please note that the stepwise decisions are marked (using an arrow) when a robust decision can be first made regarding detection, location, and authentication, respectively. All authentication decisions made by RBF were correct and achieved with high confidence.

 

bc

 

 

 

d

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

/

/

/

/

 

 

 

 

 

 

 

frame 0

frame 30

 

/

 

 

 

/

 

 

 

 

 

 

 

 

 

frame 175

frame 200

frame 250

frame 275

frame 300

frame 325

 

bc

 

 

d

 

 

 

 

 

 

frame 0

frame 15

frame 65

frame 120

0 1

0

1

0

1

0

1

 

frame 150

frame 160

frame 170

frame 185

b1

 

 

c

 

d

 

 

 

 

 

 

 

 

 

 

2

2

2

frame 0

frame 25

frame 45

frame 70

frame 85

frame 90

frame 100

frame 110

Figure 4.5 Positive and Negative (ID) Authentication of Subjects

34

CHAPTER

5

3547698;: <E><>FG@HJI=?=A@CBD@

Pattern recognition, a difficult but fundamental task for intelligent systems, depends heavily on the particular choice of the features used by the classifier. One usually starts with extracting the most represented features and then attempts to derive an optimal subset of features leading to high classification performance, known as feature selection.

We herein first address the problem of eye classification. The hybrid approach for eye classification involves deriving first optimal ('reconstructed') candidate windows in terms of (best basis) wavelet packets followed by their classification (as eye vs non-eye) using the Radial Basis Function (RBF) method. It is shown that an optimal choice of a subset of wavelet bases provides for improved RBF performance on the eye detection task.

We then describe a hybrid learning approach for optimal feature selection and the derivation of robust pattern classifiers. Our novel approach, which includes a genetic algorithm (GA) and a tree induction system (ID3), minimizes the number of features used for classification while simultaneously achieving improved classification rates. A GA is used to search the space of all possible subsets of a large set of candidate discrimination features. For a given feature subset, ID3 is invoked to produce a decision tree (DT). The classification performance of the decision tree on unseen data is used as a measure of fitness for the given feature set, which, in turn, is used by the GA to evolve better feature sets. This GA-ID3 process iterates until a feature subset is found with satisfactory classification performance. Experimental results are presented which illustrate the feasibility of our approach on difficult problems involving recognizing visual concepts in facial image data. The results also show improved classification performance and reduced description complexity when compared against standard methods for feature selection.

In the end of this section, the work expands on the approach taken by Johnson, Maes, and Darell (1994) for evolving visual routines. We use GA-DT architecture as mentioned above to discover the optimal base features representations for eye detection. The experimental results reported prove the feasibility of our approach in terms of feature selection ('data compression') and the corresponding eye detection ('pattern recognition').

K LMON P Q R+ST X!YCZTXUb^+`cedgfh\ijlknm\oqpa r\sx\ytz

The first problem one has to address in the context of developing visual routines is what form the base representations ('features') the visual routines will be call to operate on should assume. The wavelet representation provides for multiresolution analysis through the orthogonal decomposition of a function along basis functions consisting of appropriate translations and dilations of the mother wavelet function. Continuous wavelets, defined using a pair of functions φ (scaling function) and ϕ ('mother' - wavelet function), satisfy

35

φa,b (x) =| a |1/ 2 φ(ax b )

(5.1)

ψ a,b (x) =| a |1/ 2 ψ (ax b)

where a, b are real numbers, and φa,b, and ϕ a,b correspond to the scaling and mother wavelet functions dilated by a,

-m

and translated by b. The discrete wavelet transform (DWT) is obtained for the choice of a = a0 , b = nb0, where m, n are integer numbers, with the scaling and mother wavelet functions becoming

φm ,n (x) =| a0 |m / 2 φ(a0m x nb0 )

(5,2)

ψ m ,n(x) =| a0 |m / 2 ψ (a0m x nb0 )

For the choice a0 = 2, b0 = 1 one obtains

φm ,n (x) = 2m / 2φ(2 m x n)

(5.3)

ψ m ,n(x) = 2 m / 2ψ (2m x n)

The dilation equation, relating the mother wavelet to the scaling function is:

ψ ( x) = 2 åg(k)φ(2x k)

k

(5.4)

where g(k) = ( −1)k h(1 − k)

The wavelet coefficients and the corresponding wavelet decomposition are given as

+∞

cm ,n = ò f (xm,n (x)dx

−∞

(5.5)

f (x) = åcm ,nψ m, n(x)

m,n

Daubechies (1988) has shown how one can derive the corresponding low ('h') and high ('g') pass filters and how to design appropriate families of scaling and mother wavelet functions using Quadrature Mirror Filters (QMF).

Once the coefficients of the discrete wavelet transform (DWT) are derived one becomes interested in choosing an optimal subset, with respect to some reconstruction criteria, for data compression purposes. Towards that end, Coifman and Wickerhauser (1992) define the Shannon entropy (μ ) as

μ(v) = −å|| vi ||2 ln || vi ||2

(5.6)

where v = {vi} is the corresponding set of wavelet coefficients. The Shannon entropy measure is then used as a cost function for finding the best subset of wavelet coefficients. Note that minimum entropy corresponds to less randomness ('dispersion') and it thus leads to clustering. If one generates the complete wavelet representations (wavelet packets) as a binary tree, the selection of the best coefficients is done by comparing the entropy of wavelet packets corresponding to adjacent tree levels (father - son relationships). One compares the entropy of each adjacent pair of nodes to the entropy of their union and the subtree is expanded further only if it results in lesser

36

entropy. For a signal whose size is n the DWT yields n coefficients and the search for optimal coefficients yields that set (still of size n) for whom the Shannon entropy is minimized. Data compression, subject to the same entropy criteria, ranks the optimal coefficients according to their magnitude, and would pick up subsets consisting of m coefficients where m is less than n.

{ |} |G~ Cx

Windows of size 8 32 were manually cut from different regions of the human face corresponding to eye and noneye images. The images are raster scanned to form 1-D vectors consisting of 256 elements as shown in Fig. 5.1a. The training data consists of 50 eye images and 20 non-eye images, while the testing data consists of 40 eye and 10 non-eye images. Wavelets packets, corresponding to the eye and non-eye images, are derived using the Daubechies family of order two. Optimal decompositions are then found using the Shannon entropy, and the best subsets consisting of 16 or 64 coefficients are derived as explained earlier. As an example, Fig. 5.1b shows the reconstructed image using the 16 largest coefficients. The root mean square (rms) error between the reconstructed and the original image is computed as:

 

 

 

 

256

 

 

 

1 / 2

 

 

=

é

1

(x

)- f

(x ) 2

ù

(5.7)

e

f

ê

 

ú

rms

 

 

å( R

i

O

i )

 

 

 

ë256 i =1

 

 

 

û

 

The reconstructed images capture relevant characteristics ('features') and discard structural 'noise'. The reconstructed images (using 16 and 64 largest coefficients) are then passed to the Radial Basis Function classifier to assess the relevance of optimal signal decompositions ('functional approximation') for classification tasks.

Number of Wavelet

Reconstructed Images

1-D Signal

RMS

Best Bases

erms

 

 

 

250

 

 

 

 

 

 

 

200

 

 

 

 

 

 

16

150

 

 

 

 

 

29.36

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

50

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

32

64

96

128

160

192

224

 

250

 

 

 

 

 

 

 

200

 

 

 

 

 

 

256

150

 

 

 

 

 

0

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

50

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

32

64

96

128

160

192

224

Figure 5.1. Reconstructed Eye images Using Wavelet Best Bases

w ¡ C¤£¢

RBF classifier has the architecture very similar to that of a traditional three-layer back-propagation network. Connections between the input and middle layers have unit weights and, as a result, do not have to be trained. Nodes in the middle layer, called BF nodes, produce a localized response to the input using Gaussian kernels. The basis

37

functions (BF) used are Gaussians , where the activation level yi of the hidden unit i is given by:

 

 

é

D

 

(xk - uik

) 2 ù

 

 

 

yi

= Fi (|| X - ui ||) = expê- å

 

 

ú

(5.8)

 

 

2

 

 

 

 

ë

k =1

2hσ ik

û

 

 

where h is a proportionality constant for the variance, x

k

is the kth

component of the input vector X=[x

, x , ..., x ],

 

 

 

 

 

1

2

D

and μ

and σ are the kth components of the mean and variance vectors, respectively, of basis function node i. Each

ik

i

 

 

 

 

 

 

 

 

hidden unit can be viewed as a localized receptive field (RF). he hidden layer is trained using k-means clustering.

The classification stage involves the RBF scheme. Some details on the wrapper implementation include: (i) the width of the Gaussian function has been set equal to the maximum distance between the farthest pattern belonging to the same class and closest pattern of the another class, (ii) local rather than global proportionality factors are used, and (iii) patterns from the testing set are classified as belonging to the class associated with that output node yielding the largest output. RBF training stops when 100% correct classification is achieved and the number of clusters and proportionality factor are frozen at 35 and 1,000, respectively. The experiments carried out involve 50 eye and 20 non-eye images for training, and a different set consisting of 40 eye and 20 non-eye images for testing. The same experiment was performed for the case when original images were used and for those cases when the images used were reconstructed using the best 16 and 64 wavelet coefficients. The results of eye and non-eye correct

classification are shown in Table 5.1.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Performance Using Original

 

Performance Using

Performance Using

 

 

 

(8x32) Data

 

Reconstructed Data from 16

Reconstructed Data from 64

 

 

 

 

 

Wavelet Coefficients

Wavelet Coefficients

 

 

 

 

 

 

 

 

Eye

 

70 %

 

82.5 %

85.0 %

 

Non-eye

70 %

80.0 %

100 %

 

 

 

 

 

 

 

Table 5.1. Eye vs. Non-Eye Classification Results

Another experiment was carried out to assess how robust the eye detection process is on miscentered eye images. Twenty eight test images were produced by shifting the center of the eye from its original position to some new positions using upward/downward two pixels shifts and/or left and right two or four pixels shifts. The classification rates obtained for original vs reconstructed images using the best 64 wavelet coefficients were 50% and 80%, respectively. From the above experiments, one concludes that the performance of the Radial Basis Function (RBF) classifier improves when the original (raw) images are replaced by optimally reconstructed images using best wavelet coefficients. As entropy minimization drives the derivation of optimal wavelet packets, clustering ('prototyping') captures the most significant characteristics of the underlying data and is thus compatible with the first operational stage of the RBF classifier.

¥ ¦§ ¨\©ª «\¬ ­±!²\³´C©+®\¯£µ ¹¶+·°ºe»¯¸ ¼¾½

Any object or pattern that has to be recognized and/or classified must possess a number of discriminatory properties or features. The first step in any recognition process, performed either by a machine or by a human being, is to

38

choose candidate discriminatory features and evaluate them for their usefulness. Feature selection in pattern recognition involves the derivation of salient features from the raw input data in order to reduce the amount of data used for classification and simultaneously provide enhanced discriminatory power. The number of features needed to successfully perform a given classification task depends on the discriminatory qualities of the selected features.

One usually starts with a given set of features and then attempts to derive an optimal subset of features leading to high classification performance. A standard approach involves ranking the features of a candidate feature set according to some criteria involving 2nd order statistics (ANOVA) and/or information theory based measures such as "infomax", and then deleting lower ranked features. Ranking by itself is usually not enough because the criteria used do not measure the effectiveness of the features selected on the actual classification task itself, nor do they capture possible non-linear interactions among the features.

The selection of an appropriate set of features is one of the most difficult tasks in the design of pattern classification system. At the lowest level, the raw feature data is not nice clean symbolic data like "green", but rather noisy sensor data (e.g., spectral properties) the characteristics of which are complex and irregular. In addition, there is considerable interaction among low level features which must be identified and exploited. However, the typical number of possible features is so large as to prohibit any systematic exploration of all but a few possible interaction types (e.g., pairwise interactions). Large feature sets with noisy numerical data also provide considerable difficulty for traditional symbolic inductive learning systems. The running time of the learning system and the accuracy and complexity of the output rapidly fall below an acceptable level.

Feature selection in general has to confront with the problem that requires searching over a large space of nonlinear and higher-order combinations and it is usually prohibitively expensive. As the search space defining both the features themselves and their possible ('data fusion') integration is exponential in its complexity, one comes to employ alternative methods, such as those underlying natural selection and evolution. Such methods are often driven by statistical considerations using information theory concepts such as the Kullback divergence measure and maximum information preservation principle (infomax). As an example, Linsker (1988), evolved receptive fields, characteristic of center surround cells and/or orientation columns, that maximize their output activity variance. Similarly, the feed-forward competitive learning model of Neven and Aertsen (1992) develops various types of RF profiles which reflect the correlation structure of the input space, characterized by its various classes of intra - and inter -feature relations. Atick and Redlich (1993) take advantage of entropy reduction as the underlying idea when the set of RFs is derived based on minimization of sums of pixel entropies, subject to no overall loss of information.

One of the goals of this thesis is to illustrate the feasibility of deriving appropriate subsets of features and then integrating them for detecting facial landmarks. The rationale behind our approach is the belief (Michalski, 1994) that further advances in pattern analysis and classification require the integration of various learning processes in a modular fashion. Learning systems that employ several strategies can potentially offer significant advantages over single-strategy systems. Since the type of input and acquired knowledge are more flexible, such hybrid systems can be applied to a wider range of problems. Examples of such integration include combinations of genetic algorithms and neural networks (Gruau and Whitley, 1993) and genetic algorithms and rule-based systems (Bala et al, 1994; Vafaie and De Jong, 1994).

The integration of genetic algorithms and inductive decision tree learning for optimal feature selection and pattern classification is a novel application of such an approach and is the topic of this paper. We have selected ID3like induction algorithms, which use entropy as an information measure during tree derivation. This same entropy

39

underlies also the infomax principle - maximum information preservation between successive processing layers. Selforganization in perceptual networks and the development of receptive fields has been shown to be driven by such a principle. Specifically, Linsker (1988) has reported that a perceptual system develops to recognize relevant features of its environment using the infomax principle.

The integration of genetic algorithms and decision tree learning is also part of a broader issue being actively explored, namely, that evolution and learning can work synergistically (Hinton and Nowlan, 1987). The ability to learn can be shown to ease the burden on evolution. Evolution (genotype learning) only has to get close to the goal; (phenotype) learning can then fine tune the behavior (Muhlenbein and Kinderman, 1989). Although Darwinian theory does not allow for the inheritance of acquired characteristics (Lamarckian evolution), learning (acquired behaviors) can still influence the course of evolution. The Baldwin effect where local search is employed to change the fitness of strings, but the acquired improvements do not change the genetic encoding of the individual is under active study (Whitley et al, 1994). One can gain a further perspective on the Lamarckian hypothesis by moving up from the individual chromosome (agent) to ecosystems (species) and addressing cultural evolution as well (Wechsler, 1993).

¿ АБCАВOГ ДnЕ Ж ПРЗЙИ!КСТТНУ

The basic idea of our hybrid system is to use GAs to efficiently explore the space of all possible subsets of a given feature set in order to find feature subsets which are of low order and high discriminatory power. In order to achieve this goal, we felt that fitness evaluation had to involve direct measures of size and classification performance, rather than measures such as the ranking methods discussed in the previous section. The speed of DT suggested the feasibility of the approach shown in Figure 5.2.

An initial set of features is provided together with a training set of the measured feature vectors extracted from raw data corresponding to examples of concepts for which the decision tree is to be induced. The genetic algorithm (GA) is used to explore the space of all subsets of the given feature set where preference is given to those features sets which achieve better classification performance using smaller dimensionality feature sets. Each of the selected feature subsets is evaluated (its fitness measured) by testing the decision tree produced by C4.5 (Quinlan, 86). The above process is iterated along evolutionary lines and the best feature subset found is then recommended to be used in the actual design of the pattern classification system.

Full Feature Set

EVOLUTION OF

Optimal Feature Subset

 

 

FEATURE SUBSETS

 

FITNESS MEASURE:

 

-

Classification Accuracy

 

-Classification Cost

-Information Content

Figure 5.2 Hybrid Learning System Using Genetic Algorithms and Decision Trees

In order for a GA to efficiently search such large spaces, one must give careful thought to both the representation chosen and the evaluation function. In this case, there is a very natural representation of the space of all possible subsets of a feature set, namely, a fixed-length binary string representation in which the value of the ith

40