Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Стратегии детекции в рампознавании лиц / Detection Strategies For Face Recognition Using Learning And Evolution Phd Dissertation 1998

.pdf
Скачиваний:
17
Добавлен:
01.05.2014
Размер:
1.05 Mб
Скачать

Since offspring reside near their parents with high probability, over several generations the (Baldwin) effect would accelerate convergence toward the more fit genomes discovered by local search without "back coding" them in the genome.

As an example, Liu and Wechsler (1998) integrate evolution and learning for capturing the non-accidental spatiotemporal properties (‘regularities’) called Optimal Projection Axes (OPA) for face recognition by searching through all the rotations defined over whitened PCA subspaces. Evolution is driven by a fitness function defined in terms of performance accuracy and class separation (‘scatter index’). Accuracy indicates the extent to which learning has been successful so far, while the scatter index gives an indication of the expected fitness on future trials. Experimental results using a large date set (1107 facial images from the US Army FERET database) and recognition results – 92% comparing with other methods - eigenfaces (87%) and MDF (86%) demonstrate the feasibility of their approach.

In the next chapters we describes a hybrid methodology that integrates genetic algorithms ('evolution') and decision tree ('learning') in order to derive useful subsets of discriminatory features for recognizing complex visual concepts. A genetic algorithm (GA) is used to search the space of all possible subsets of a large set of candidate discrimination features. Candidate feature subsets are evaluated by using C4.5, a decision-tree learning algorithm, to produce a decision tree based on the given features using a limited amount of training data. The classification performance of the resulting decision tree on unseen testing data is used as the fitness of the underlying feature subset. Experimental results show how increasing the amount of learning significantly improves feature set evolution for difficult visual recognition problems involving facial image data (Bala, De Jong, Huang, Vafaie, and Wechsler, 1996). We observe then that the actual number of features in the classifier is significantly less than that specified by the genome. This information is not back-coded into the genome; only the improved fitness is returned.

21

CHAPTER

4

сут ф хчцщш¡ъ ш ы ъ

People can easily recognize familiar human faces. Automated face recognition, however, is a difficult problem since face images can vary considerably in terms of facial expressions, age, image quality and photometry, geometry, occlusion, and disguise (Samal and Iyengar, 1992; Chellapa et. al., 1995). Face recognition starts with the detection of face patterns, proceeds by normalizing the face images to account for geometrical and illumination changes, possibly using information about the facial landmarks, identifies the faces using appropriate classification algorithms, and post processes the results using model-based schemes and logistic feedback. The variety of methods published in the literature show that there is not a unique or generic solution to the face recognition problem. Consequently, a taxonomy of face recognition technology cannot identify clear cut algorithms, but rather processing strategies describing (i) face detection and normalization, (ii) feature extraction, (iii) coding and internal representation, and (iv) classification and recognition.

Face recognition usually starts with (the location and) detection of face patterns. The reason for face detection is to focus computational resources on the face area (and thus speed up further processing), and to provide the spatial context for the normalization stage.

The objective of any face detection system is to segment the face (foreground) from its background and provide its 'box' boundary. Most face recognition systems assume that the face box is already available. Amongst those trying to detect the face box, Yang and Huang (1993) have developed a face detection system for complex backgrounds using a hierarchical knowledge-based method. The accuracy reported is 83% on a database of 60 subjects. Burel and Carel (1994) use a multi-resolution detection and localization phase consisting of scanning each image at seven scales. For each scale, the window contents are normalized to a standard size, and then propagated through a Multi Layer Perceptron (MLP). The accuracy reported is 85% on a database of 40 subjects. Sung and Poggio (1994) introduce view-based approach for face detection. In their view-based approach, faces are treated as a class of local target patterns to be detected in an image. They define a stable “ canonical” face model and generate a multidimensional Gaussian cluster with the local data distribution around centroid as pattern prototype for the purpose of face matching. 4150 normalized canonical face patterns are used to synthesize 6 “ face” pattern prototypes in the multi-dimensional image vector space and 149 face patterns contained in 23 images are tested. The system achieves a 79.9% detection rate. As another example, Rowley et. al. (1995) use a cluster of back propagation networks for face detection. A (retinal) connected neural network examines small windows of an image, and decides whether

22

each window contains a face. The system arbitrates between multiple networks to improve performance over a single network. The network is trained using bootstrapping and the accuracy reported is 92.9% on a database of 65 images from complex backgrounds. From the publications mentioned above it is clear that the size of the image database is quite restricted and that all the image data is used for training. As a consequence no conclusions can be drawn about the ability of such methods to generalize (on unseen test data) and to scale up on large image data bases, possibly consisting of several thousands images.

! "

Concerning with the face detection task, we develop the algorithm that decides first if a face is present, and if the face is present it crops ('box') the face. The approach used for face detection involves three main stages, those of location, cropping, and post processing. The first stage finds a rough approximation for the possible location of the face box, the second stage will refine it, and the last stage will possibly decide whether face is present in the image. The first stage locates BOX1, possibly surrounding the face, using simple but fast algorithms in order to focus processing for the cropping stage. The location stage consists of three steps: (i) histogram equalization, (ii) edge detection, and (iii) analysis of projection profiles. The cropping stage labels each 8x8 window from BOX1 as face or non-face using decision trees (DT). The DT are induced ('learned') from both positive ('face') and negative ('nonface') examples expressed in terms of features such as entropy, mean, and standard deviation (sd). The labeled output of the cropping stage, is post processed to (i) decide if a face is present, and if it is present to (ii) derive its

BOX2 in terms of straight boundaries, and to (iii) normalize the box to yield uniform sized boxes BOX3 for all face images. The architecture for implementing the above procedure for detecting the face box is shown below in Fig. 4.1.

Input Images

Face Location

 

 

 

 

 

Face Detection

 

PostProcessing

Histogram Equalization

BOX1

BOX2

Feature Extraction

Profile Analysis

Edge Detection

 

Decision Trees

 

Normalization

Profile Analysis

 

 

 

 

 

 

BOX3

 

 

 

 

Figure 4.1. Face Detection

#%$ & ,')(+-*/. 0 1 :'32;<54+5; BDCFE6)7=?>5@8AG+E9@ H I J K

For the most part, the performance of face recognition systems reported in the literature has been measured on small databases, with each research site carrying out experiments using their own database and thus making meaningful comparisons and drawing conclusions impossible (Robertson and Craw, 1994). The majority of those databases were collected under very controlled situations and consisted of a relatively small number of subjects and corresponding images. To overcome such shortcomings, we have been developing over the last several years the FERET facial database so a standard tested for face recognition applications can become available (DePersia and Phillips, 1995; Gutta, 1995). The FERET database consists now of 1,564 sets comprising 14,126 images. It

23

contains 1,199 individuals and 365 duplicate sets taken at different times and possibly wearing glasses - consisting of several poses. Since large amounts of images were acquired during different photo sessions, the lighting conditions and the size of the facial images can vary. The diversity of the FERET database is across gender, race, and age and includes duplicates as well. Fig. 4.2 is indicative of the range of facial images the FERET database consists now of.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

00002fa010.93083

00026fa000.93123

00089fa010.93123

00029fa010.94012

00259fa010.94012

00303fa010.94042

00311fa010.94042

00702fa010.94120

00392fa010.94042

00203fa010.94018

 

 

1

0

0

8

8

2

 

 

1

2

 

 

 

00002fb010.93083

00026fb000.93123

00089fb010.93123

00029fb010.94012

00259fb010.94012

00303fb010.94042

00311fb010.94042

00702fb010.94120

00392fb010.94042

00203fb010.94018

1

0

0

8

8

2

2

1

2

 

00002hl010.93083

00026hl000.93123

00089pl010.93123

00029hl010.94012

00259hl010.94012

00303hl010.94042

00311hl010.94042

00702hl010.94120

00392hl010.94042

00203hl010.94018

1

0

0

8

8

2

2

1

2

 

00002hr010.93081

00026hr000.93123

00089hr010.93123

00029hr010.94012

00259hr010.94012

00303hr010.94042

00311hr010.94042

00702hr010.94120

00393hr010.94042

00203hr010.94018

 

0

0

8

8

2

 

1

2

 

00002pr010.93081

00026pr000.93123

00089pr010.93123

00029pr010.94018

00259pr010.94012

00303pr010.94042

00311pr010.94042

00702pr010.94120

00391pr010.94042

00203pr010.94018

 

0

0

 

8

2

 

1

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.2. Facial Images from FERET

Acquisition of duplicate sets is very important if one wants to assess how robust is a given face recognition system when tested on images shot at different times, which are likely to be different. Most of the sets consist of the following poses: two frontal shots ('fa' and 'fb'), 1/4 half (right and left) profiles ('qr' and 'ql'), 3/4 half (right and left) profiles ('hr' and 'hl'), and right and left (90 deg.) profiles ('pr' and 'pl'). In addition we recently collected several hundred sets that have several additional poses at the midpoints between: 'hr' and 'qr' ('ra'), 'qr' and frontal view ('rb'), frontal view and 'ql' ('rc'), 'ql' and 'hl' ('rd'), and between 'hl' and 'pl' ('re'). The additional poses were taken to assess the capability of modeling the human face using several positions and interpolating or extrapolating amongst them for future identification tasks. The facial image sets were acquired without any restrictions imposed on expression and with two or three frontal images shot at different times during the photo session. They are used as the testbed in the different experiments.

L M NPO QVR3SUTTWPV Z?[]\`ba+c^ h i j)k)l

We proposes a novel algorithm for face detection using decision trees (DT) and shows its generality and feasibility

24

using a data base consisting of 2,340 face images from the FERET data base (corresponding to 817 subjects and including 190 sets of duplicates) over a semi uniform background. The algorithm decides first if a face is present, and if the face is present it crops ('box') the face. Experiments were also performed to assess the accuracy of the algorithm in rejecting images where no face is present using a small database of 25 images of various but complex background.

m nm noqprtsuFvw sxrxyzw {

The location stage indicates where is most likely to find the face contents and thus focuses attention for the next stage, that of (face) cropping. The whole procedure is straightforward and yields a rough approximation for the sought after face box.

Histogram equalization makes the image's cumulative histogram approximately linear and accounts for illumination differences as experienced during the image acquisition process. Edge detection finds transitions ('edges') in the intensity image using the Marr-Hildreth operator (Hildreth and Marr, 1980). The location stage proceeds now with horizontal and vertical projections along the x - and y - axes, respectively. The resulting profiles are searched for their maximum and minimum values using model-based ('face') constraints. The horizontal projection profile is searched for its extrema points by first detecting all its local extrema (min and max) points, and then measuring the distance d between each min and the following max peak. The top side of the boundary box is placed at the max peak location yielding the largest d distance. A similar procedure is used to analyze the vertical projection and yields the left and right sides of the boundary box. BOX1 is finally completed on its bottom side such that the ratio between the vertical and horizontal dimensions of the box is 1.5. Note that this procedure can process faces of different sizes.

| }| }~D t ] P + +

The cropping stage refines the output coming from the location stage using DT. The DT are learned ('induced') from both positive ('face') and negative ('non-face') examples expressed in terms of extracted features such as entropy, mean, and standard deviation (sd). The detection stage is run across 8x8 windows from BOX1.

Each 8x8 window and its corresponding four quadrants is processed to yield the features required for learning from examples and the derivation of the DT. Entropy, mean, and standard deviation features are derived using original image data or Laplacian processed image data from the face box to yield thirty (30=5x3x2) features values.

Once features values become available learning from examples will induce DT. As inductive learning requires symbolic data, each one of the positive examples corresponding to the face region is tagged ‘CORRECT’, while the negative or counter-positive examples corresponding to non-face regions are tagged as ‘INCORRECT’. Decision trees are induced using C4.5 (Quinlan, 1986). The input to the C4.5 consists of a string of learning events, each event given as a vector of 30 attribute values. C4.5 takes a set of positive examples and a set of negative examples and builds a classifier as a decision tree whose structure consists of

leaves, indicating class identity, or

decision nodes that specify some test to be carried out on a single attribute value, with one branch for each possible outcome of the test.

A decision tree is now used to classify each 8x8 non-overlapping window from BOX1 by starting at the root of

25

the tree and moving through it until a leaf is encountered. At each non-leaf node a decision is made using an optimal (entropy) criterion such as the gain ratio criterion. As a result of this procedure each 8x8 window from

BOX1 is now labeled as face or non-face. The data set used to train the face comes from 12 images whose resolution is 256x384 and consists of 2759 windows (CORRECT / positive '+') examples, and 15673 (INCORRECT / negative '-') examples. Note that inducing the DT corresponds to training and requires suitable strategies. The (corrective) training strategy used here, that of tuning the DT on learning events the DT fails to resolve, corresponds to active learning (Krogh and Vedelsby, 1995).

D P ) x t x

The output from the cropping stage consists of labeled 8 8 (face vs. non-face) windows. Horizontal and vertical profiles count the number of face labels and are thresholded to yield the extent of the face image, if present at all, and provide a new and refined approximation for the face as BOX2. Small counts in both the horizontal and vertical profiles indicate the absence of a face. Normalization takes BOX2, if found at all, and produces a standard size

BOX3, available for further processing.

P 3 ¡)¢£

The database for our experiments consists of 2,340 frontal face images coming from 817 subjects and 190 duplicate subjects. Each image is of size 256X384 using 256 gray scale levels. In order to assess the feasibility of the our algorithm to reject images for whom no face is present we assembled a small video data base consisting of 25 images of varied but complex backgrounds, whose size is 320X240 using again 256 gray scale levels. Examples of face and non-face images are shown in Fig. 4.3a and 4.3b, respectively.

The whole process for one face image is shown in Fig. 4.4. It display first the original face image, then it shows the intermediate outputs, BOX1, labeled contents of BOX1 and the corresponding horizontal and vertical profiles, the refined BOX2, and finally the normalized BOX3.

(a) Face imges

(b) Non-Face images

Figure 4.3 Face and Non-Face Images

26

Input Images

 

 

 

 

 

Face Location

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Face Detection

 

 

 

 

 

PostProcessing

Input Images

 

 

Histogram Equalization

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BOX1

 

 

Feature Extraction

 

 

BOX2

 

 

Profile Analysis

 

 

Edge Detection

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Decision Trees

 

 

 

 

 

Normalization

 

 

 

 

 

Profile Analysis

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

BOX3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.4 Face Detection Process

The performance obtained, stage wise, is as follows. Based on visual observation, the accuracy (head within the box) of the location (1st) stage is 99% - 22 incorrect boxes (out of 2340 image) were produced. The accuracy for the whole process is based on the visual observation that the box includes both eyes, nose, and mouth, and that the top side of the box is below the hairline. Note that incorrect as well as correct boxes are passed as BOX1 to the cropping stage. The overall accuracy at the end of the process is 96%. Accuracy regarding cropping the face is based on the visual observation that the box includes the both eyes, nose, and mouth, and that the top side of the box is below the hairline. For the experiment using 25 non-face images, the algorithm failed on two images for an overall accuracy rate of 92%. Fig. 4.5 shows the final result, BOX3, if at all, for the image examples shown in Fig. 4.5.

(a)

(b)

Figure 4.5 Examples of (a) Face and (b) Non - Face Dtection

27

One comment regarding computation is of interest. The first stage, that of (face) location, serves the role of an attention mechanism, and thus it focuses processing for later stages. The timings for the location and cropping stage are 7.9 seconds and 152 seconds per image, respectively, using SUN SPARC-20 with 50 MHz processor. When the location stage is not used and the cropping stage takes place over the whole image, the time required is 288 seconds per image. On the average, the use of the location ('attention') stage thus reduces computation by 45%, while the corresponding space search is reduced by 53%.

¤¦¥ §+¨ ©­ ª3«U¬¬®P°²±´³]µ­¯ ¸º¹]»²¼»²½<¾¿/À Á

To explore the extent to which color information is helpful with face location, we have performed two experiments for locating the face using (i) color and (ii) gray scale images, respectively. Both the color and its corresponding gray scale data set consist of 200 images at a resolution of 64x96. Among the 200 images 20 images are used to generate training examples while the remaining 180 images are used to generate test cases. The approach used for face location involves two stages: (i) windows classification and (ii) post processing. The first stage labels 4x4 windows as face or non-face cases using decision trees (DT). The DT for gray scale images are induced ('learned') from both positive ('face') and negative ('non-face') examples expressed in terms of 12 features such as entropy and mean with / without Laplacian preprocessing. The same types of features are used for both gray scale and color images. As color images consist of R, G, and B channels, 36 rather than 12 features are computed for each color window. The labeled output from the first stage is post processed using horizontal and vertical projections to locate the boundaries defining the face boxes. The projections count the number of 4x4 windows labeled as face cases.

For the color data set, we normalize the color values from the RGB space to rgb space, where

r =

R

, g =

G

 

, b =

B

 

 

 

 

 

(4.1)

R + G + B

R + G

 

 

 

 

+ B

R + G + B

Each 4x4 window and its corresponding four 2x2 quadrants are processed to yield the features required for learning from examples and the derivation of the DT. Entropy and mean features are derived using original image data or Laplacian preprocessed image data from each 4x4 window to yield 36 features and 12 feature vales for color and gray scale images, respectively. Once the features values become available learning from examples will induce the DT. A decision tree will then be used to classify each 4x4 non-overlapping window from test images.

Experiments were run on a total of 200 face images corresponding to 200 subjects with a resolution of 64x96 encoded using 256 gray scale levels for gray-scale images, and using the r, g, b normalized color components for color images. A sample set of face images is shown below in Fig. 4.6. The 200 images were divided into sets of 20 and 180 images for training and testing, respectively. From each of the 20 training images, a total of 384 windows of dimension 4x4 (some boxes corresponding to the face and others corresponding to the non-face regions) were manually labeled. 7680 windows (+/- examples) consisting of 36 (from color images) or 12 (from gray scale images) feature values for each window are used to train the DT.

28

ДЕЖЗxИЙЖКЛ

МЛНОПЖРЛК ДЕЛРОПЖРЛКТСУЛtОПЖРЛК

Figure 4.6. Gray Scale vs. r, g, b channels using Color Images

The accuracy rate on face location is based on visual inspection to determine if the box obtained includes both eyes, nose, and mouth, and that the top side of the box is below the hairline. The overall accuracy rate is 85.5% for gray scale images and 90.6% for color images. Examples of box faces are shown in Fig. 4.7.

Figure 4.7. Examples of Face Detection results

Ô ÕÖ ×+Ø ÙÝÚ3ÛUÜÜÞPÝ ßà²á´â]ãæèçbéê ëìîí ñ ï)òô ó ï

As more and more forensic information becomes available on video some effort of face detection has been done for the Automatic Video-Based Biometric Person Authentication (AVBPA) in this thesis. For an AVBPA system, possible tasks and application scenarios under consideration involve detection and tracking of humans and human (ID) verification. Authentication corresponds to ID verification and involves actual (face) recognition for the subject(s) detected in the video sequence. As it is described below the architecture for AVBPA takes advantage of the active vision paradigm and it involves difference methods or optical flow analysis to detect the moving subject, projection analysis and decision trees (DT) for face location, and connectionist network - Radial Basis Function (RBF) for authentication.

Active Vision (AV) has advanced the widely held belief that intelligent data collection rather than image recovery and reconstruction is the goal of perception. Active vision, also referred to as Active, Purposive and Selective Perception (APSP), involves a large degree of adaptation, and provides an intelligent (video) observer (system) with the capability to decide where to seek information, what information to pick up, and how to process it.

29

APSP is a process of intelligent control applied to both data acquisition and processing, and its specific ('reactive') activation depends on the current state of data interpretation. As an example, our architecture first seeks to decide 'where' to gather information for subsequent ID verification, while if the confidence associated with any processing stage is low further sensing and gathering of information is performed. Further sensing and gathering of information would be required in the following two possible scenarios - (i) if the face location module fails detecting a face for a particular frame, and (ii) if the face location module detects a face for whom the authentication stage fails to reach a high confidence decision. The active vision paradigm is most appropriate for video processing where one has to cope with huge amounts of image data and where further sensing and processing of additional frames is feasible. As a result of such an approach video processing becomes feasible in terms of decreased computational resources ('time') spent and increased confidence in the (authentication) decisions reached despite sometime poor quality imagery.

The scenarios considered under Automatic Video-based Biometric Person Authentication (AVBPA) involve video sequences consisting of moving subjects approaching the video camera. The overall architecture for such AVBPA scenarios is shown in Fig. 4.8. Preprocessing, not shown in Fig. 4.8, first reduces the amount of future processing by skimming the video sequence at reduced sample rates (six rather than thirty frames per second). The contents of frames from the skimmed video sequences are then assessed as of low vs. high signal-to-noise ratio. Our architecture, then consists basically of three main stages, those of (frame) detection of the moving subject ('video break'), location of subjects' faces ('key frames') and authentication ('MATCH - recognition') of persons appearing in the video sequence. The first stage, that of frame detection, can be achieved using any of two methods: (i) difference method and (ii) optical flow. Once the moving subject is identified the second stage is activated. The second stage, that of face location, locates the subject's face using inductive decision trees. The face location stage goes on iteratively until a (key) video frame is found where the face is properly located. Following the location of the face projection analysis yields a box surrounding the face where the next stage, that of authentication, will be restricted to take place. Should boxing the face fail the whole process is restarted from the next frame. The boxed face is recognized (MATCHed) using an RBF network. If the confidence associated with authentication is 'low', the whole process described above is repeated starting from the last key frame detected. The whole AVBPA process and the tools needed to implement it are described in detail below.

 

 

 

Face Location

 

 

 

 

 

Projection

 

RBF

Skimmed

 

 

 

Network

 

 

Analysis

 

Video Sequence

 

 

 

 

 

 

 

 

 

 

 

 

BOX1

 

Authentication

 

 

 

b’

 

 

High

Optical

 

 

a

Signal-

Flow

Face Cropping

 

 

Frame

to-noise

 

 

 

 

Advancer

 

Decision Trees

 

 

Ratio

Frame

 

 

 

(DT)

 

 

 

Low

Difference

 

Low

 

b

 

 

Video Break

BOX2

 

Confidence

 

Fail

 

 

 

 

 

Success

 

 

 

 

 

Face Normalization

 

d High

 

 

 

Projection Analysis

 

 

 

 

 

 

 

 

 

BOX3

c

Done

 

 

 

 

 

 

 

 

Face Detection

 

 

Figure 4.8. AVBPA Architecture

30