Diss / 8
.pdfIEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008 |
897 |
Maximum-Entropy Expectation-Maximization
Algorithm for Image Reconstruction
and Sensor Field Estimation
Hunsop Hong, Student Member, IEEE, and Dan Schonfeld, Senior Member, IEEE
Abstract—In this paper, we propose a maximum-entropy expec- tation-maximization (MEEM) algorithm. We use the proposed algorithm for density estimation. The maximum-entropy constraint is imposed for smoothness of the estimated density function. The derivation of the MEEM algorithm requires determination of the covariance matrix in the framework of the maximum-entropy likelihood function, which is difficult to solve analytically. We, therefore, derive the MEEM algorithm by optimizing a lower-bound of the maximum-entropy likelihood function. We note that the classical expectation-maximization (EM) algorithm has been employed previously for 2-D density estimation. We propose to extend the use of the classical EM algorithm for image recovery from randomly sampled data and sensor field estimation from randomly scattered sensor networks. We further propose to use our approach in density estimation, image recovery and sensor field estimation. Computer simulation experiments are used to demonstrate the superior performance of the proposed MEEM algorithm in comparison to existing methods.
Index Terms—Expectation-maximization (EM), Gaussian mixture model (GMM), image reconstrution, Kernel density estimation, maximum entropy, Parzen density, sensor field estimation.
I. INTRODUCTION
STIMATING an unknown probability density function E(pdf) given a finite set of observations is an important aspect of many image processing problems. The Parzen windows method [1] is one of the most popular methods which provides a nonparametric approximation of the pdf based on the underlying observations. It can be shown to converge to an arbitrary density function as the number of samples increases. The sample requirement, however, is extremely high and grows dramatically as the complexity of the underlying density function increases. Reducing the computational cost of the Parzen windows density estimation method is an active area of research. Girolami and He [2] present an excellent review of recent developments in the literature. There are three broad categories of methods adopted to reduce the computational cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition method [3], b) data
Manuscript received March 29, 2007; revised January 13, 2008. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Gaurav Sharma.
The authors are with the Multimedia Communications Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, IL 60607-7053 USA (e-mail: hhong6@uic.edu; dans@uic.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2008.921996
reduction methods [4], and c) sparse functional approximation method.
Sparse functional approximation methods like support vector machines (SVM) [5], obtain a sparse representation in approximation coefficients and, therefore, reduce computational costs for performance on a test set. Excellent results are obtained using these methods. However, these methods scale as making them expensive computationally. The reduced set density estimator (RSDE) developed by Girolami and He [2] provides a superior sparse functional approximation method which is designed to minimize an integrated squared-error (ISE) cost function. The RSDE formulates a quadratic programming problem and solves it for a reduced set of nonzero coefficients to arrive at an estimate of the pdf. Despite the computational efficiency of the RDSE in density estimation, it can be shown that this method suffers from some important limitations [6]. In particular, not only does the linear term in the ISE measure result in a sparse representation, but its optimization leads to assigning all the weights to zero with the exception of the sample point closest to the mode as observed in [2] and [6]. As a result, the ISE-based approach to density estimation degenerates to a trivial solution characterized by an impulse coefficient distribution resulting in a single kernel density function as the number of data samples increases.
However, the expectation-maximization algorithm (EM) [7] provides a very effective and popular alternative for estimating model parameters. It provides an iterative solution, which converges to a local maximum of the likelihood function. Although the solution to the EM algorithm provides the maximum likelihood estimate of the kernel model for density function, the resulting estimate is not guaranteed to be smooth and may still preserve some of the sharpness of the ISE-based density estimation methods. A common method used in regularization theory to ensure smooth estimates is to impose the maximum entropy constraint. There have been some attempts to bind the entropy criterion with EM algorithm. Byrne [8] proposed an iterative image reconstruction algorithm based on cross-entropy minimization using the Kullback–Leibler (KL) divergence measure [9]. Benavent et al. [10] presented an entropy-based EM algorithm for the Gaussian mixture model in order to determine the optimal number of centers. However, despite the efforts to use maximum entropy to obtain smoother density estimates, thus far, there have been no successful attempts to expand the EM algorithm by incorporating a maximum-entropy penalty-based approach to estimating the optimal weight, mean and covariance matrix.
In this paper, we introduce several novel methods for smooth kernel density estimation by relying on a maximum-entropy
1057-7149/$25.00 © 2008 IEEE
898 |
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008 |
penalty and use the proposed methods for the solution of important applications in image reconstruction and sensor field estimation. The remainder of the paper is organizes as follows. In Section II, we first introduce kernel density estimation and present the integrated squared-error (ISE) cost function. We subsequently introduce the maximum-entropy ISE-based density estimation to ensure that the estimated density function is smooth and does not suffer from the degeneracy of the ISE-based kernel density estimation. Determination of the maximum-entropy ISE-based cost function is a difficult task and generally requires the use of iterative optimization techniques. We propose the hierarchical maximum entropy kernel density estimation (HMEKDE) method by using a hierarchical tree structure for the decomposition of the density estimation problem under the maximum-entropy constraint at multiple resolutions. We derive a closed-form solution to the hierarchical maximum-entropy kernel density estimate for implementation on binary trees. We also propose an iterative solution to a penalty-based maximum-entropy density estimation by using Newton’s method. The methods discussed in this section provide the optimal weights for kernel density estimates which rely on fixed kernels located at few samples. In Section III, we propose the maximum-entropy expectation maximization (MEEM) algorithm to provide the optimal estimates of the weight, mean, and covariance for kernel density estimation. We investigate the performance of the proposed MEEM algorithm for 2-D density estimation and provide computer simulation experiments comparing the various methods presented for the solution of maximum-entropy kernel density estimation in Section IV. We propose the application of both the EM and MEEM algorithms for image reconstruction from randomly sampled images and sensor field estimation from randomly scattered sensors in Section V. The basic EM algorithm estimates a complete data set from partial data sets, and, therefore, we propose to use the EM and MEEM algorithms in these image reconstruction and sensor network applications. We present computer simulations of the performance of the various methods for kernel density estimation for these applications and discuss the advantages and disadvantages in various applications. A discussion of the performance of the MEEM algorithm as the number of kernels varies is provided in Section VI. Finally, in Section VII, we provide a brief summary and discussion of our results.
II.KERNEL-BASED DENSITY ESTIMATION
A. Parzen Density Estimation
The parzen density estimator using the Gaussian Kernel is given by Torkkola [11]
(1)
where is the total number of observation and is the isotropic Gaussian kernel defined by
The main limitation of the Parzen windows density estimator is the very high computational cost due to the very large number of kernels required for its representation.
B. Kernel Density Estimation
We seek an approximation to the true density of the form
(3)
where and the function denotes the Gaussian kernel defined in (2). The weights must be determined such that the overall model remains a pdf, i.e.,
(4)
Later in this paper, we will explore the simultaneous optimization of the mean, variance, and weights of the Gaussian kernels. Here, we focus exclusively on the weights . The variances and means of the Gaussian kernels are estimated by using the -means algorithm in order to reduce the computational burden. Specifically, the centers of the kernels in (3) are determined by -means clustering, and the variance of the kernels is set to the mean of Euclidean distance between centers [12]. We assume that is significantly greater than since the Parzen method relies on delta functions at the sample data which are represented by Gaussian functions with very narrow variance. The mixture of Gaussian model, on the other hand, relies on a few Gaussian kernels and the variance of each Gaussian function is designed to capture many sample points.
Therefore, only the coefficients are unknown. We rely on minimization of the error between and using the ISE method. The ISE cost function is given by
(5)
Substituting and , using (1) and (3), the equation can be expanded and the order of integration and summation exchanged. Thus, we can write the cost function of (5) in vectormatrix form
(6)
where
(7)
Our goal is to minimize this function with respect to under
(2)
the conditions provided by (4). Equation (6) is a quadratic programming problem, which has a unique solution if the matrix
HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM |
899 |
is positive semi-definite [13]. Therefore, can be simplified to
.
In Appendix A, we prove that the solution of the ISE-based kernel density estimation degenerates as the number of observations increases to a trivial solution that concentrates the estimated probability mass in a single kernel. This degeneracy leads to a sharp peak in the estimated density, which is characterized by the minimum-entropy solution.
C. Maximum-Entropy Kernel Density Estimation
Given observations from an unknown probability distribution, there may exist an infinity of probability distributions consistent with the observations and any given constraints [14]. The maximum entropy principle states that under such circumstances we are required to be maximally uncertain about what we do not know, which corresponds to selecting the density with the highest entropy among all candidate solutions to the problem. In order to avoid degenerate solutions to (6), we maximize the entropy and minimize the divergence between the estimated distribution and the Parzen windows density estimate. Here, we use Renyi’s quadratic entropy measure given by [11], which is defined as
Newton’s method for multiple variables is given in [15]
(12)
where denotes the iteration. We will use the soft-max function for the weight constraint [16]. The weight of the center can be expressed as
(13)
Therefore, the derivative of the weight with respect to is given by
(14)
.
For convenience, we define the following variables:
(15)
(16)
(8) |
(17) |
||
Substituting (3) into (8), we obtain |
|
|
(18) |
|
|||
By expanding the square, interchanging the order of summation |
|
|
(19) |
|
|||
We can now express (11) using (15) and (18) |
|||
and integration, we obtain the following: |
(20) |
||
|
|||
(9) |
The element of the gradient of (20) is given by |
Since the logarithm is a monotonic function, maximizing the logarithm of a function is equivalent to maximizing the function. Thus, the maximum entropy solution of the entropy can be reached by maximizing the function expressed in vector-matrix form
The derivation of the gradient is provided in Appendix B. From (57), (58), and (62), the element of the Hessian matrix is given by the following.
a)
The optimal maximum entropy solution |
of |
is |
(10)
where is subject to the constraints provided by (4).
1) Penalty-Based Approach Using Newton’s Method: We adopt the penalty-based approach by introducing an arbitrary constant to balance between the ISE and entropy cost functions. We, therefore, define a new cost function given by
where is the penalty coefficient. Since the variable is constant with respect to it will be omitted. We now have
(11)
(21)
b)
(22)
The detailed derivation of the Hessian matrix are also presented in Appendix B. We assume that the Hessian matrix is positive definite. Finally, the gradient and Hessian required for the iteration in (12) can be generated using (21), (22), and (59).
2) Constrained-Based Approach Using a Hierarchical Binary Tree: Our preference is to avoid penalty-based methods and to derive the optimal weights as a constrained optimization problem. Specifically, we seek the maximum entropy weights
900 |
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008 |
Fig. 1. Binary tree structure for hierarchical density estimation.
The constraint in the maximum entropy problem is defined such that its corresponding ISE cost function does not exceed the optimal ISE cost beyond a prespecified value . From (6), (25), and (26), we can determine the optimal ISE coefficient by minimization of the cost given by
(27)
such that . It is easy to show that
such that its corresponding ISE cost function |
does |
||
not exceed the optimal ISE cost |
beyond a prespecified |
||
value . We thus define the maximum-entropy coefficients |
|
||
to be given by |
|
|
|
|
|
|
(23) |
such that |
. |
|
|
A closed-form solution to this problem is difficult to obtain in general. However, we can obtain the closed-form solution when the number of centers is limited to two. Hence, we form an iterative process, where we assume that we only have two centers at each iteration. We represent this iterative process as a hierarchical model, which generates new centers at each iteration. We use a binary tree to illustrate the hierarchical model, where each node in the tree depicts a single kernel. Therefore, in the binary tree, each parent node has two children nodes as seen in Fig. 1. The final density function corresponds to the kernels at the leafs of the tree. We now wish to determine the maximum entropy kernel density estimation at each iteration of the hierarchical binary tree. We, therefore, seek the maximum entropy coefficients. Note that sum of these coefficients is dictated by the corresponding coefficients of their parent node. This restriction will ensure that the sum of the coefficients of all the leave nodes (i.e., nodes with no children) is one since we set the coefficient of the root parent node to 1. We simplify the notation
by considering |
and |
to be the coefficients of the children |
|
nodes where |
is used to denote the coefficient of their corre- |
||
sponding parent node (i.e., |
). This implies that |
||
it is sufficient to characterize the optimal coefficient |
such that |
||
. |
|
|
|
The samples are divided into two groups using |
-means |
method at each node. Let us adopt the following notation:
(24)
(28)
Therefore, from (6), (25), (26), and (28), we have
|
|
|
(29) |
|
|
|
|
We assume, without loss of generality, that |
. Therefore, |
||
the constant |
|
is equivalent to |
|
|
. |
|
From (10) and (25), we observe that the maximum entropy
coefficient |
is given by |
|
(30) |
such that |
and |
.
Therefore, from (30), we form the Lagrangian given by
Differentiating with respect to and setting to zero, we have
(31)
We shall now determine the Lagrange multiplier by satisfying the constraint
(32)
From (31) and (32), we observe that
(33)
Therefore, from (33) and (31), we observe that
where |
. From (6) and (7), we observe that |
|
|
|
|
|
|
|
|
(34) |
||
|
|
|
|
|
|
|
|
|||||
|
|
|
(25) |
Finally, we impose the condition |
|
. Therefore, |
||||||
|
|
|
(26) |
from (34), we have |
|
|
||||||
where |
, |
, and |
. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM |
901 |
III. MAXIMUM-ENTROPY
EXPECTATION-MAXIMIZATION ALGORITHM
As seen in previous section, the ISE-based methods enable pdf estimation given a set of observations without information about the underlying density. However, the ISE based solutions do not fully utilize the sample information as the number of samples increases. Moreover, ISE-based methods are generally used to determine optimal weights used in the linear combination. Selection of the mean and variance of the kernel functions is accomplished by using the -means algorithm, which can be viewed as a hard limiting case of the EM [7]. The EM algorithm offers an approximation of the pdf by an iterative optimization under the maximum likelihood criterion.
A probability density function can be approximated as the sum of Gaussian functions
(35)
where is center of a Gaussian function, is a covariance matrix of function and is the weight for each center which subject to the conditions as (4). The Gaussian function is given by
The expectation step of the EM algorithm can be separated into two terms, one is the expectation related with likelihood and the other is the expectation related with the entropy penalty
(40)
(41)
where denotes that this expectation is from the likelihood function, denotes that this expectation is from the entropy penalty, and denotes the number of iteration.
The Jensen’s inequality is applied to find the new lower bound of the likelihood functions using (40) and (41). Therefore, the lower bound function for the likelihood function can be derived as
(36)
From (35) and (36), we observe that the logarithm of the likelihood function for the given Gaussian mixture parameters that has observations can be written as
(37)
where is the sample and is a set of parameters (i.e., the weights, centers, and covariances) to be estimated.
The entropy term is added in order to make the estimated density function smooth and not to have an impulse distribution. We expand Renyi’s quadratic entropy measure [11] to incorporate with covariance matrices and use the measure again. Substituting (35) into (8), expanding the square and interchanging the order of summation and integration, we obtain the following:
(38)
We, therefore, form an augmented likelihood function parameterized by a positive scalar in order to simultaneously maximize the entropy and likelihood using (37) and (38). The augmented likelihood function is given by
|
|
(42) |
Now, we wish to obtain a lower bound |
for the entropy |
|
|
. This bound cannot be derived using the method in (42) |
|
since |
is not a concave function. To derive the lower |
bound, we, therefore, rely on a monotonically decreasing and concave function such that . The detailed derivation is provided in Appendix C. Notice that maximization of the entropy remains unchanged if we replace the function in (38) by since both are monotonically decreasing functions. We can now use Jensens inequality to obtain the lower bound for the entropy
The lower bound |
which combines the two lower |
bounds is given by |
|
|
(43) |
|
Since we have the lower bound function, the new estimates of |
|
the parameters are easily calculated by setting the derivatives of |
(39) |
with respect to each parameters to zero. |
902 |
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008 |
A. Mean
The new estimates for the mean vectors can be obtained by the derivative of (43) with respect to and setting it to zero. Therefore
(44)
(47)
Using (47) and the symmetric property of Gaussian, we thus introduce a new lower bound for the covariance given by
B. Weight
For the weights, we once again use the soft-max function in (13) and (14). Thus, by setting the derivative of with respect to to zero, the new estimated weight is given by
(45)
C. Covariance
In order to update the EM algorithm, the derivative of (43) with respect to is required. However, the derivative cannot be solved directly because of the existence of the inverse matrix which appears in the derivative. We, therefore, introduce a new lower bound for the EM algorithm using Cauchy–Schwartz inequality. The lower bound given by (43) can be rewritten as
The term |
|
|
(46) |
|
|||
|
in (46) is equal to |
||
|
. Using the Cauchy–Schwartz |
inequality and the fact that the Gaussian function is greater than or equal to zero, we obtain
Therefore, the new estimated covariance |
is attained by |
||
setting the derivative of the new lower bound |
|
with |
|
respect to to zero |
|
|
|
|
|
|
(48) |
|
|
|
We note that the EM algorithm presented here relies on a simple extension of the lower-bound maximization method in [17]. In particular, we can use this method to prove that our algorithm converges to a local maximum on the bound generated by the Cauchy–Schwartz inequality, which serves as a lower bound on the augmented likelihood function. Moreover, we would have attained a local maximum of the augmented likelihood function had we not used the Cauchy–Schwartz inequality to obtain a lower bound for the sum of the covariances. Note that the Cauchy–Schwartz inequality is met with equality if and only if the covariance matrices of the different kernels are identical. Therefore, if the kernels are restricted to have the same covariance structure, the maximum-entropy expecta- tion-maximization algorithm converges to a local maximum of the augmented likelihood function.
IV. TWO-DIMENSIONAL DENSITY ESTIMATION
We apply MEEM method and other conventional methods to a 2-D density estimation problem. Fig. 2(a) describes original 2-D density function and Fig. 2(b) displays a scatter plot of 500 data samples drawn from (49) in the interval [0,1]. The equation used for generating the samples is given by
(49)
where . Given data without knowledge of the underlying density function used to generate the observations, we must estimate the 2-D density function. Here, we use 500, 1000, 1500, and 2000 samples for the experiment. With the exception of the RSDE method, the other approaches cannot be used to determine the optimal number of centers since it will
HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM |
903 |
Fig. 3. SNR improvements according to iteration and the parameter .
Fig. 2. Comparison of 2-D density estimation from 500 samples. (a) Original density function; (b) 500 samples; (c) RSDE; (d) HMEKDE; (e) Newton’s
method; (f) conventional EM; (g) MEEM.
TABLE I
SNR COMPARISON OF ALGORITHM FOR 2-D DENSITY ESTIMATION
fluctuate based on variations in the problem (e.g., initial conditions). We determine the number of centers experimentally such that we assign less than 100 samples per center for Newton’s method, EM and MEEM. For the HMEKDE method, we terminate the splitting of the hierarchical tree when the leaf has less than 5% of total number of samples.
The results of RSDE are shown in Fig. 2(c). RSDE method is very powerful algorithm in that it requires no parameters for the estimation. However, the choice for the kernel width is very crucial since it suffers the degeneracy problem when the kernel width is large and the reduction performance is diminished when the kernel width is small. The results of Newton’s method and HMEKDE are given in Fig. 2(d) and (e), respectively. The major practical issue in implementing Newton’s method is the guarantee of local minimum, which can be sustained by positive definitiveness of Hessian matrix [15]. Thus, we use the Levenberg–Marquardt algorithm [18], [19]. The value in HMEKDE method is chosen experimentally. The results of the conventional EM algorithm and the MEEM algorithm are shown in Fig. 2(f) and (g), respectively. The variable in MEEM algorithm is chosen experimentally. The result of MEEM is properly smoothed.
In Fig. 3, SNR improvements according to iteration and the value of is displayed using 300 samples. We choose the value as proportional to the number of samples. The parameter values multiplied by the number of samples, are shown in Fig. 3 (i.e., 0.05, 0.10, and 0.15). We observe the over-fitting problem of the EM algorithm in Fig. 3. The overall improvements in SNR are given in Table I.
V. IMAGE RECONSTRUCTION AND SENSOR FIELD ESTIMATION
Density estimation problem can easily expanded into practical problems like image reconstruction from random sample. For experiment, we use 256 256 gray Pepper, Lena, and Barbara images which is shown in Fig. 4(a)–(c).
We take 50% samples of Pepper image, 60% samples of Lena image and 70% of Barbara image. We use density function model in [20] where is the intensity value and
is the location of a pixel. We estimate a density function of given image from samples. For the reduction of computational
Fig. 4. Three 256 2 256 gray images used for the experiments. (a) Pepper,
(b) Lena, and (c) Barbara and two sensor fields used for sensor field estimation from randomly scattered sensor: (d) polynomial sensor field and (e) artificial sensor field.
burden, 50% overlapped 16 16 blocks are used for the experiment. Since the smoothness is different from block to block, we choose the smoothing parameter for each block experimentally. The initial center location is equally spaced. We use 3 3 centers for experiment. Using the estimated density function, we can estimate the intensity value of given location using expectation operation of conditional density distribution function. The sampled image and the reconstruction results of Lena are shown in Fig. 5.
We can also expand our approach into the estimation of sensor field from randomly scattered sensors. In this experiment, we generate an arbitrary field using polynomials in Fig. 4(d) and an artificial field in Fig.4(e). The original sensor field is randomly sampled and 2% of samples is used for the polynomial field and 30% of samples are used for the artificial field. We use density function model where L is intensity value and is the location of sensor. 50% overlapped 32 32 blocks and 16
16 blocks are used for the estimation of polynomial sensor field and artificial sensor field respectively for computational time. We also choose the smoothing parameter for each block experimentally. The initial center location is equally spaced. We use 3 3 centers for each experiment. We estimate a density function of given field using sensors. For each algorithm except HMEKDE, we use equally spaced centers for the initial location of center. The sampled sensor field and the estimation results of
904 |
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008 |
TABLE II
SNR COMPARISON OF DENSITY ESTIMATION ALGORITHM FOR IMAGE RECONSTRUCTION AND SENSOR FIELD ESTIMATION
VI. DISCUSSION
Fig. 5. Comparison of density estimation for image reconstruction from randomly sampled image. (a) 60% sampled image; (b) RSDE; (c) HMEKDE;
(d) Newton’s method; (e) conventional EM; (f) MEEM.
In this section, we discuss the relationship between the number of center and minimum/maximum entropy. Our experimental results indicate that, in most cases, the results under the maximum entropy show better results than the conventional EM algorithm. However, in some limited cases, like when we use a small number of centers, the results of minimum entropy penalty shows better results than the results of the conventional EM algorithm and maximum entropy penalty. This is due to the characteristics of maximum and minimum entropy, which is well described in [21]. The maximum entropy solution provides us smooth solution. In the case that the number of centers are relatively sufficient, each center can represent piecewise one Gaussian component, which means the resulting density function can be described better under maximum entropy criterion. On the contrary, the minimum entropy solution gives us the least smooth distribution. In the case that the number of centers are insufficient, each center should represent a large number of samples; thus, the resulting distribution described by a center should be the least smooth one, since each center cannot be described in terms of piecewise Gaussian any more. However, the larger number of centers used, the better the result.
Fig. 6. Comparison of density estimation for artificial sensor field estimation from randomly scattered sensor. (a) 30% sampled sensor; (b) RSDE;
(c) HMEKDE; (d) Newton’s method; (e) conventional EM; (f) MEEM.
artificial field are given in Fig. 6. The signal to noise ratio of the results and the computational time are also given in Table II.
VII. CONCLUSION
In this paper, we develop a new algorithm for density estimation using the EM algorithm with a ME constraint. The proposed MEEM algorithm provides a recursive method to compute a smooth estimate of the maximum likelihood estimate. The MEEM algorithm is particularly suitable for tasks that require the estimation of a smooth function from limited or partial data, such as image reconstruction and sensor field estimation. We demonstrated the superior performance of the proposed MEEM algorithm in comparison to various methods (including the traditional EM algorithm) in application to 2-D density estimation, image reconstruction from randomly sampled data, and sensor field estimation from scattered sensor networks.
HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM |
905 |
APPENDIX A |
|
|
impulse function. In particular, we assume that the elements in |
||||||
DEGENERACY OF THE KERNEL DENSITY ESTIMATION |
|
the vector |
have a unique maximum element |
with index |
|||||
This appendix illustrates the degeneracy of kernel density es- |
. This assumption generally corresponds to the case where |
||||||||
the true density function has a distinct maximum leading to a |
|||||||||
timation discussed in [6]. We will show that the ISE cost func- |
|||||||||
high density region in the data samples. We show that the op- |
|||||||||
tion converges asymptotically to the linear linear term |
as |
||||||||
timal distribution of the coefficients |
obtained from the so- |
||||||||
the number of data samples increases. Moreover, we show that |
|||||||||
lution of the linear programming problem in (50) is character- |
|||||||||
optimization of the linear term |
leads to a trivial solution |
||||||||
ized by a spike corresponding to the maximum element and zero |
|||||||||
where all of the coefficients are zero except one which is con- |
|||||||||
for all other coefficients. |
|
|
|
||||||
sistent with the observation in [2]. We will, therefore, establish |
|
|
|
||||||
Proposition 2: |
if and only if |
and |
|||||||
that the minimal ISE coefficients will converge to an impulse |
|||||||||
|
. |
|
|
|
|||||
coefficient distribution as the number of data samples increases. |
|
|
|
|
|||||
Proof: We observe that |
|
|
|
||||||
In the following proposition, we prove that the ISE cost function |
|
|
|
||||||
|
|
|
|
|
|||||
in (6) decays asymptotically to the linear linear term |
|
|
|
|
|
|
|||
as the number of data samples |
increases. |
|
|
|
|
|
|
(51) |
|
Proposition 1: |
as |
. |
|
|
|
|
|
|
|
Proof: The ratio of the quadratic and linear term in (6) is |
If we set |
and |
|
on the left side of (51), |
|||||
given by |
|
|
|
|
|||||
|
|
|
and apply the constraint |
on the right, the inequality is |
|||||
|
|
|
|
met as an equality. Or
We now prove the converse,
. Therefore
Expanding the sum, we obtain
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Canceling common terms and grouping terms with like coeffi- |
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cients, we observe that |
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(52) |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Since |
in (52), this implies |
, |
. |
|
|||
where we |
conclude that |
the quadratic |
term |
|
decays |
|
|||||||||||||||||||
|
|
||||||||||||||||||||||||
|
This result can be easily extended to the case where the |
||||||||||||||||||||||||
asymptotically at an exponential rate with increasing number |
|||||||||||||||||||||||||
elements in |
the vector |
have |
maximum |
element |
at |
||||||||||||||||||||
of data samples and the quadratic programming minimizing |
|||||||||||||||||||||||||
indexes |
where |
|
. This situation gener- |
||||||||||||||||||||||
problem in (6) reduces to a linear programming problem de- |
|
||||||||||||||||||||||||
ally arises when the true density function has several nearly |
|||||||||||||||||||||||||
fined by the linear term |
. |
|
|
|
|
|
|
|
|
||||||||||||||||
|
|
|
|
|
|
|
|
equal modes leading to a few high density regions in the data |
|||||||||||||||||
|
|
|
|
|
|
|
|
||||||||||||||||||
Therefore, we can now determine the minimal ISE coeffi- |
|||||||||||||||||||||||||
sample. In this case, we can show that |
, where |
||||||||||||||||||||||||
cients |
as the number of data samples |
increases from (6) |
|||||||||||||||||||||||
|
if and only if |
|
and |
when |
|||||||||||||||||||||
by minimization of the linear programming problem defined by |
|
|
|||||||||||||||||||||||
|
|
|
|
|
|
|
|||||||||||||||||||
; i.e., |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We now observe that the minimal ISE coefficient distribution |
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(50) |
decays asymptotically to a Kronecker delta function |
||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
as |
the number |
of data |
samples |
increases |
(i.e., |
||
such that |
|
|
|
|
|
and |
when |
|
|
|
. |
|
|
|
, when |
|
and |
, when |
|||||||
In the following proposition, we show that the linear program- |
. |
|
|
|
|
|
|
||||||||||||||||||
ming problem corresponding to the minimal ISE cost function |
Corollary 1: |
as |
. |
|
|
|
|||||||||||||||||||
as the number of data samples |
increases degenerates to a |
Proof: The proof is obtained directly from Propositions 1 |
|||||||||||||||||||||||
trivial distribution of the coefficients |
which consists of an |
and 2. |
|
|
|
|
|
|
|||||||||||||||||
|
|
|
|
|
|
||||||||||||||||||||
|
|
|
|
|
|
906 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008
This corollary implies that the minimal ISE kernel density es- |
gradient of (20) which requires the gradient of |
and |
. |
|||||||||||||
timation leads to the degenerative approximation |
which |
Thus, from (16) and (17), we can express the gradient of |
as |
|||||||||||||
consists of a single kernel and is given by |
|
|
|
|
|
|
|
|
|
|
|
|
||||
as the number of samples |
increases [see (3)]. |
(53) |
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
We will now examine the entropy of the degenerative distribu- |
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
||||||
tion |
given by (53), which has the lowest entropy among |
|
|
|
|
|
|
|
|
|
|
|
||||
all possible kernel density estimates. |
|
|
|
|
|
|
|
|
|
|
|
|
||||
Proposition 3: |
. |
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
Proof: We observe that |
|
, for all |
and . |
|
|
|
|
|
|
|
|
|
|
|
|
Therefore, it follows that |
|
|
|
|
|
|
|
|
|
|
|
|
|
(57) |
||
|
|
|
|
|
|
Similarly, from (18) and (19), the gradient of |
can be ex- |
|||||||||
|
|
|
|
|
|
pressed as |
|
|
Therefore, we have
(54)
Taking logarithms on both sides and multiplying by 1, we
obtain |
|
|
|
|
|
|
|
|
|
(55) |
|
|
|
|
|
||
We now compute the entropy |
of the degenerative |
||||
distribution |
. From (2), (9), and (53), we obtain |
||||
We now add |
|
|
|
(56) |
|
|
|
|
|||
to both sides of (55) and using (9) and |
(59), we observe that
This completes the proofs. From the proposition above, we observe that the ISE-based kernel density estimation yields the lowest entropy kernel density estimation. It results in a kernel density estimate that consists of a single kernel. This result presents a clear indication of
the limitation of ISE-based cost functions.
APPENDIX B
GRADIENT AND HESSIAN IN NEWTON’S METHOD
In this appendix, we provide the detailed derivation of the gradient and the Hessian matrix of (12). First, we present the
(58)
Thus, the element of the gradient can be expressed as
(59)
The element of the Hessian matrix can be expressed as
a)
(60)
b)
(61)