Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Econometrics2011

.pdf
Скачиваний:
10
Добавлен:
21.03.2016
Размер:
1.77 Mб
Скачать

CHAPTER 17. PANEL DATA

and

243

di =

0 d...i1

1

;

 

@

A

 

 

B din

C

 

an n 1 dummy vector with a “1”in the i0th place. Let

 

 

 

u = 0 u...1

1:

 

 

 

B un

C

 

Then note that

 

@

A

 

 

 

 

 

 

 

ui = di0u;

 

and

 

 

 

 

 

yit

= xit0 + di0u + eit:

(17.2)

Observe that

E(eit j xit; di) = 0;

 

 

 

so (17.2) is a valid regression, with di as a regressor along with xi:

 

OLS on (17.2) yields estimator

^

: Conventional inference applies.

 

; u^

 

Observe that

 

 

 

 

This is generally consistent.

If xit contains an intercept, it will be collinear with di; so the intercept is typically omitted from xit:

Any regressor in xit which is constant over time for all individuals (e.g., their gender) will be collinear with di; so will have to be omitted.

There are n + k regression parameters, which is quite large as typically n is very large.

Computationally, you do not want to actually implement conventional OLS estimation, as the parameter space is too large. OLS estimation of proceeds by the FWL theorem. Stacking the observations together:

 

 

y = X + Du + e;

 

then by the FWL theorem,

 

 

 

 

 

 

 

 

 

 

 

^

= X0 (I

P D) X

1

X0 (I P D) y

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

=

X0X

 

X0

 

 

 

 

 

 

 

 

 

 

 

y

 

;

 

where

 

 

 

 

 

 

 

 

 

 

 

 

 

y = y D(D0D) 1D0y

 

 

 

X = X

 

D(D0D) 1D0X:

 

 

 

 

 

 

 

 

 

 

 

 

Since the regression of yit on di is a regression onto individual-speci…c dummies, the predicted value from these regressions is the individual speci…c mean yi; and the residual is the demean value

yit = yit yi:

^

The …xed e¤ects estimator is OLS of yit on xit, the dependent variable and regressors in deviation- from-mean form.

CHAPTER 17. PANEL DATA

244

Another derivation of the estimator is to take the equation

 

yit = x0it + ui + eit;

and then take individual-speci…c means by taking the average for the i0th individual:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 ti

yit =

1

 

ti

x0 + ui +

1

ti

eit

 

tXi

 

 

Xi

 

Xi

Ti =t

 

Ti t=t

it

Ti t=t

 

 

 

 

or

yi = x0i + ui + ei:

Subtracting, we …nd

yit = x0it + eit;

which is free of the individual-e¤ect ui:

17.3Dynamic Panel Regression

A dynamic panel regression has a lagged dependent variable

yit = yit 1 + xit0 + ui + eit:

(17.3)

This is a model suitable for studying dynamic behavior of individual agents.

Unfortunately, the …xed e¤ects estimator is inconsistent, at least if T is held …nite as n ! 1: This is because the sample mean of yit 1 is correlated with that of eit:

The standard approach to estimate a dynamic panel is to combine …rst-di¤erencing with IV or GMM. Taking …rst-di¤erences of (17.3) eliminates the individual-speci…c e¤ect:

yit = yit 1 + xit0 + eit:

(17.4)

However, if eit is iid, then it will be correlated with yit 1 :

E( yit 1 eit) = E((yit 1 yit 2) (eit eit 1)) = E(yit 1eit 1) = 2e:

So OLS on (17.4) will be inconsistent.

But if there are valid instruments, then IV or GMM can be used to estimate the equation. Typically, we use lags of the dependent variable, two periods back, as yt 2 is uncorrelated witheit: Thus values of yit k; k 2, are valid instruments.

Hence a valid estimator of and is to estimate (17.4) by IV using yt 2 as an instrument foryt 1 (which is just identi…ed). Alternatively, GMM using yt 2 and yt 3 as instruments (which is overidenti…ed, but loses a time-series observation).

A more sophisticated GMM estimator recognizes that for time-periods later in the sample, there are more instruments available, so the instrument list should be di¤erent for each equation. This is conveniently organized by the GMM principle, as this enables the moments from the di¤erent timeperiods to be stacked together to create a list of all the moment conditions. A simple application of GMM yields the parameter estimates and standard errors.

Chapter 18

Nonparametrics

18.1Kernel Density Estimation

Let X be a random variable with continuous distribution F (x) and density f(x) =

d

F (x):

 

 

 

f(x) from a random sample (X ; :::; X

 

 

dx

 

 

ng

While F (x) can be estimated by

The goal is to estimate n

1 d

 

 

 

the EDF F^(x) = n 1

P

i=1 1 (Xi x) ; we cannot de…ne

 

F^(x) since F^(x) is a step function. The

 

dx

standard nonparametric method to estimate f(x) is based on smoothing using a kernel. While we are typically interested in estimating the entire function f(x); we can simply focus

on the problem where x is a speci…c …xed number, and then see how the method generalizes to estimating the entire function.

De…nition 18.1.1 K(u) is a second-order kernel function if it is a symmetric zero-mean density function.

Three common choices for kernels include the Normal

 

 

 

1

 

 

 

 

u2

 

K(u) =

p

 

exp

 

 

 

2

 

2

the Epanechnikov

 

43

 

0

 

juj > 1

 

K(u) =

 

1 u2

 

; juj 1

and the Biweight or Quartic

 

 

 

 

 

 

 

 

juj > 1

 

1615

0

2

 

 

K(u) =

 

1 u2

 

; juj 1

In practice, the choice between these three rarely makes a meaningful di¤erence in the estimates. The kernel functions are used to smooth the data. The amount of smoothing is controlled by

the bandwidth h > 0. Let

 

u

:

1

 

Kh(u) =

 

K

 

 

h

h

be the kernel K rescaled by the bandwidth h: The kernel density estimator of f(x) is

f^(x) = n1 Xn Kh (Xi x) :

i=1

245

CHAPTER 18. NONPARAMETRICS

246

This estimator is the average of a set of weights. If a large number of the observations Xi are near x; then the weights are relatively large and f^(x) is larger. Conversely, if only a few Xi are near x; then the weights are small and f^(x) is small. The bandwidth h controls the meaning of “near”.

Interestingly, f^(x) is a valid density. That is, f^(x) 0 for all x; and

 

 

Z 1 f^(x)dx = Z

1

n

 

1 n

Z 1 Kh (Xi x) dx =

1 n

Z

1 K (u) du = 1

1

 

i=1 Kh (Xi x) dx =

 

i=1

 

i=1

n

n

n

1

1

X

 

 

X

1

 

 

X

 

1

where the second-to-last equality makes the change-of-variables u = (Xi x)=h:

 

We can also calculate the moments of the density f^(x): The mean is

 

 

Z

1 xf^(x)dx =

1 n

Z 1 xKh (Xi x) dx

 

 

 

 

 

 

i=1

 

 

 

 

 

n

 

 

 

 

 

1

 

 

X

1

 

 

 

 

 

 

 

 

 

 

 

 

1 n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

= n i=1

Z 1

(Xi + uh) K (u) du

 

 

 

 

 

 

 

 

 

 

n

Xi Z

 

 

 

 

n

 

 

 

 

 

 

 

 

 

 

X

1

 

 

X

1

 

 

 

 

 

 

 

 

 

 

 

 

 

= n i=1

1 K (u) du + n i=1 h Z

1 uK (u) du

 

 

 

1

 

 

 

 

1

 

 

 

 

 

 

1Xn

=n i=1 Xi

the sample mean of the Xi; where the second-to-last equality used the change-of-variables u = (Xi x)=h which has Jacobian h:

The second moment of the estimated density is

 

 

 

 

 

 

 

 

 

Z

 

1

n

 

Z

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 x2f^(x)dx =

 

i=1

1 x2Kh (Xi x) dx

 

 

 

 

 

n

 

 

 

 

 

 

1

 

 

X

Z

1

 

 

 

 

 

 

 

 

 

 

 

 

 

1

n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

= n i=1

1 (Xi + uh)2 K (u) du

 

 

 

 

 

 

 

1

n

 

 

 

2 n

 

 

1

n

 

 

 

 

 

 

 

X

 

 

 

X

 

 

 

 

 

X

 

 

 

 

 

 

 

 

 

 

1

1

 

 

 

= n i=1 Xi2 + n i=1 Xih Z

1 K(u)du + n i=1 h2 Z

 

1 u2K (u) du

 

 

=

1

n

 

Xi2 + h2 K2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Xi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n =1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

K2 = Z 1 u2K (u) du

 

 

 

 

 

is the variance of the kernel. It follows that the variance of the density f^(x) is

 

 

 

1

 

 

1

 

 

 

2

 

 

1 n

 

1 n

2

 

Z

 

 

xf^(x)dx

 

 

 

!

 

1

1

 

 

 

 

X

 

 

 

X

 

 

 

 

 

 

 

 

 

x2f^(x)dx Z

 

 

 

= n i=1 Xi2 + h2 K2 n i=1 Xi

 

 

 

 

 

 

 

 

 

 

 

=

^2 + h2 K2

 

 

 

 

 

Thus the variance of the estimated density is in‡ated by the factor h2 2

relative to the sample

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

K

 

 

 

moment.

CHAPTER 18. NONPARAMETRICS

247

18.2Asymptotic MSE for Kernel Estimates

For …xed x and bandwidth h observe that

1

1

1

EKh (X x) = Z 1 Kh (z x) f(z)dz =

Z 1 Kh (uh) f(x + hu)hdu =

Z 1 K (u) f(x + hu)du

The second equality uses the change-of variables u = (z x)=h: The last expression shows that the expected value is an average of f(z) locally about x:

This integral (typically) is not analytically solvable, so we approximate it using a second order Taylor expansion of f(x + hu) in the argument hu about hu = 0; which is valid as h ! 0: Thus

 

 

f (x + hu) ' f(x) + f0(x)hu +

1

f00(x)h2u2

 

 

 

 

 

 

 

 

 

2

 

and therefore

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

EKh (X x) '

Z 1 K (u)

f(x) + f0(x)hu +

 

f00(x)h2u2 du

 

2

1

 

 

1

 

 

 

 

1

 

 

 

1

 

 

 

=

f(x) Z 1 K (u) du + f0(x)h Z 1 K (u) udu +

 

f00(x)h2

Z 1 K (u) u2du

2

= f(x) +

1

f00(x)h2 2

:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

K

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The bias of f^(x) is then

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Bias(x) =

f^(x)

f(x) =

1

n

K

(X

x)

f(x) =

1

f00(x)h2 2 :

 

Xi

 

 

E

 

 

 

 

E h

 

 

i

2

K

 

 

 

n =1

 

 

We see that the bias of f^(x) at x depends on the second derivative f00(x): The sharper the derivative, the greater the bias. Intuitively, the estimator f^(x) smooths data local to Xi = x; so is estimating a smoothed version of f(x): The bias results from this smoothing, and is larger the greater the curvature in f(x):

We now examine the variance of f^(x): Since it is an average of iid random variables, using …rst-order Taylor approximations and the fact that n 1 is of smaller order than (nh) 1

var (x) =

=

'

=

'

=

1

n var (Kh (Xi x))

1

 

 

 

 

 

 

1

 

 

 

 

 

 

EKh (Xi x)2

 

 

(EKh (Xi x))2

n

n

1

 

 

 

1 K

z x

 

2 f(z)dz

 

1

f(x)2

 

 

 

 

 

h

 

nh2

Z 1

 

 

n

1

 

 

 

1

 

 

 

 

 

 

 

 

 

Z 1 K (u)2 f (x + hu) du

 

 

 

nh

 

 

 

f (x)

1

 

 

 

 

 

 

 

 

 

 

Z 1 K (u)2 du

 

 

 

 

nh

 

 

 

 

 

f (x) R(K)

:

 

 

 

 

 

 

 

 

 

 

 

nh

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where R(K) =

 

1

K (u)2 du is called the roughness of K:

 

 

 

 

1

 

 

 

 

 

 

 

 

 

Together,

the asymptotic mean-squared error (AMSE) for …xed x is the sum of the approximate

 

R

 

 

 

 

 

 

 

 

 

 

squared bias and approximate variance

 

 

 

 

 

 

 

 

AMSE

 

(x) =

1

f00

(x)2h4 4

+

f (x) R(K)

:

 

 

 

 

h

 

 

 

 

 

 

 

4

K

 

nh

 

 

 

 

 

 

 

 

 

 

CHAPTER 18. NONPARAMETRICS

248

A global measure of precision is the asymptotic mean integrated squared error (AMISE)

AMISEh = Z

 

h4

4 R(f

)

 

(K)

 

 

AMSEh(x)dx =

 

K

00

 

+

R

:

(18.1)

 

4

 

 

nh

where R(f00) = R (f00(x))2 dx is the roughness of f00: Notice that the …rst term (the squared bias) is increasing in h and the second term (the variance) is decreasing in nh: Thus for the AMISE to decline with n; we need h ! 0 but nh ! 1: That is, h must tend to zero, but at a slower rate than n 1:

Equation (18.1) is an asymptotic approximation to the MSE. We de…ne the asymptotically optimal bandwidth h0 as the value which minimizes this approximate MSE. That is,

h0 = argmin AMISEh

h

It can be found by solving the …rst order condition

 

d

 

AMISE

 

= h3 4

R(f00)

 

R(K)

= 0

 

 

 

h

 

 

 

dh

nh2

 

 

 

K

 

 

 

 

yielding

h0 =

 

 

 

 

 

 

 

 

 

 

 

 

 

R(K)

 

1=5

 

 

 

 

 

 

 

 

 

 

n 1=2:

(18.2)

 

 

 

 

 

 

 

 

 

 

K4 R(f00)

 

This solution takes the form h0 = cn 1=5 where c is a function of K and f; but not of n: We thus say that the optimal bandwidth is of order O(n 1=5): Note that this h declines to zero, but at a very slow rate.

In practice, how should the bandwidth be selected? This is a di¢ cult problem, and there is a large and continuing literature on the subject. The asymptotically optimal choice given in (18.2) depends on R(K); 2K; and R(f00): The …rst two are determined by the kernel function. Their values for the three functions introduced in the previous section are given here.

K

K2 =

1 u2K (u) du R(K) =

11 K (u)2 du

Gaussian

 

R 11

1/R(2p

 

)

 

 

Epanechnikov

 

1=5

 

1=5

 

 

Biweight

 

1=7

 

5=7

 

 

An obvious di¢ culty is that R(f00) is unknown. A classic simple solution proposed by Silverman (1986)has come to be known as the reference bandwidth or Silverman’s Rule-of-Thumb. It uses formula (18.2) but replaces R(f00) with ^ 5R( 00); where is the N(0; 1) distribution and ^2 is an estimate of 2 = var(X): This choice for h gives an optimal rule when f(x) is normal, and gives a nearly optimal rule when f(x) is close to normal. The downside is that if the density is very far from normal, the rule-of-thumb h can be quite ine¢ cient. We can calculate that R( 00) = 3= (8p ) : Together with the above table, we …nd the reference rules for the three kernel functions introduced earlier.

Gaussian Kernel: hrule = 1:06^n 1=5 Epanechnikov Kernel: hrule = 2:34^n 1=5 Biweight (Quartic) Kernel: hrule = 2:78^n 1=5

Unless you delve more deeply into kernel estimation methods the rule-of-thumb bandwidth is a good practical bandwidth choice, perhaps adjusted by visual inspection of the resulting estimate f^(x): There are other approaches, but implementation can be delicate. I now discuss some of these choices. The plug-in approach is to estimate R(f00) in a …rst step, and then plug this estimate into the formula (18.2). This is more treacherous than may …rst appear, as the optimal h for estimation of the roughness R(f00) is quite di¤erent than the optimal h for estimation of f(x): However, there

CHAPTER 18. NONPARAMETRICS

249

are modern versions of this estimator work well, in particular the iterative method of Sheather and Jones (1991). Another popular choice for selection of h is cross-validation. This works by constructing an estimate of the MISE using leave-one-out estimators. There are some desirable properties of cross-validation bandwidths, but they are also known to converge very slowly to the optimal values. They are also quite ill-behaved when the data has some discretization (as is common in economics), in which case the cross-validation rule can sometimes select very small bandwidths leading to dramatically undersmoothed estimates. Fortunately there are remedies, which are known as smoothed cross-validation which is a close cousin of the bootstrap.

Appendix A

Matrix Algebra

A.1 Notation

A scalar a is a single number.

A vector a is a k 1 list of numbers, typically arranged in a column. We write this as

0 a2

1

a1

 

B a

k

C

B

C

a = B ...

C

@

 

A

Equivalently, a vector a is an element of Euclidean k space, written as a 2 Rk: If k = 1 then a is a scalar.

A matrix A is a k r rectangular array of numbers, written as

23

 

 

a11

a12

 

a1r

7

A =

6 a21...

a22...

 

a2...r

 

6 a

a

 

 

a

 

7

 

4

k1

 

k2

 

kr

5

 

6

 

 

 

7

By convention aij refers to the element in the i0th row and j0th column of A: If r = 1 then A is a column vector. If k = 1 then A is a row vector. If r = k = 1; then A is a scalar.

A standard convention (which we will follow in this text whenever possible) is to denote scalars by lower-case italics (a); vectors by lower-case bold italics (a); and matrices by upper-case bold italics (A): Sometimes a matrix A is denoted by the symbol (aij):

A matrix can be written as a set of column vectors or as a set of row vectors. That is,

 

 

 

 

 

 

 

2

2

3

 

 

 

 

 

 

 

 

1

 

A = a1 a2

 

ar

 

6

 

k

7

 

=

6

...

7

 

6

 

7

where

 

 

 

 

 

 

4

 

 

5

 

2 a2i

3

 

 

 

 

 

 

 

 

a1i

 

 

 

 

 

 

 

ai =

6 a

ki

7

 

 

 

 

 

 

6

 

7

 

 

 

 

 

 

6 ...

7

 

 

 

 

 

are column vectors and

 

4

 

 

5

 

 

 

 

 

aj1 aj2

 

 

 

 

 

j =

ajr

 

 

250

APPENDIX A. MATRIX ALGEBRA

251

are row vectors.

The transpose of a matrix, denoted A0; is obtained by ‡ipping the matrix on its diagonal.

Thus

2 a12

a22

 

ak2

3

 

 

 

a11

a21

 

ak1

7

A0 =

6 ...

...

 

a

...

 

6 a

 

a

 

 

7

 

4

 

1r

2r

 

kr

5

 

6

 

 

 

7

Alternatively, letting B = A0; then bij

= aji.

Note that if A is k r, then A0 is r k: If a is a

k 1 vector, then a0 is a 1 k row vector. An alternative notation for the transpose of A is A>: A matrix is square if k = r: A square matrix is symmetric if A = A0; which requires aij = aji:

A square matrix is diagonal if the o¤-diagonal elements are all zero, so that aij = 0 if i =6 j: A square matrix is upper (lower) diagonal if all elements below (above) the diagonal equal zero.

An important diagonal matrix is the identity matrix, which has ones on the diagonal. The k k identity matrix is denoted as

23

 

6

1

0

 

0

7

 

Ik =

0...

1...

 

0...

:

 

6

0

0

 

1

7

 

 

6

 

 

 

 

7

 

 

4

 

 

 

 

5

 

A partitioned matrix takes the form

A =

 

A11

A12

 

A1r

7

6 ... ...

...

 

2

A21

A22

 

A2r

3

 

6

A

 

A

 

 

A

 

7

 

4

 

k1

 

k2

 

 

kr

5

 

6

 

 

 

 

7

where the Aij denote matrices, vectors and/or scalars.

A.2 Matrix Addition

If the matrices A = (aij) and B = (bij) are of the same order, we de…ne the sum

A + B = (aij + bij) :

Matrix addition follows the communtative and associative laws:

A + B =

B + A

A + (B + C) =

(A + B) + C:

A.3 Matrix Multiplication

If A is k r and c is real, we de…ne their product as

Ac = cA = (aijc) :

If a and b are both k 1; then their inner product is

Xk

a0b = a1b1 + a2b2 + + akbk = ajbj:

j=1

Note that a0b = b0a: We say that two vectors a and b are orthogonal if a0b = 0:

APPENDIX A. MATRIX ALGEBRA

252

If A is k r and B is r s; so that the number of columns of A equals the number of rows of B; we say that A and B are conformable. In this event the matrix product AB is de…ned. Writing A as a set of row vectors and B as a set of column vectors (each of length r); then the

matrix product is de…ned as

2 a20

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a10

 

 

 

 

 

 

 

 

 

AB =

6 a

k0

7

 

b2

 

bs

 

 

6

7

 

 

 

6 ...

7 b1

 

 

 

 

4

 

5

 

 

 

 

 

3

 

 

a10 b1

a10 b2

 

a10 bs

 

 

2 a20 b1

a20 b2

a20 bs

 

=

6 ...

 

...

 

 

...

 

7

:

 

4

k0

1

k0

2

 

k0

 

s

5

 

 

6

 

b

7

 

 

6 a

 

b

a b

 

 

a

 

7

 

Matrix multiplication is not communicative: in general AB 6= BA: However, it is associative and distributive:

A (BC) = (AB) C

A (B + C) = AB + AC

An alternative way to write the matrix product is to use matrix partitions. For example,

 

 

A11

A12

B11

B12

 

 

 

 

 

AB =

A21

A22

B21

B22

 

 

 

 

=

 

A11B11 + A12B21

A11B12 + A12B22

:

A21B11 + A22B21

A21B12 + A22B22

As another example,

 

 

 

 

 

2

 

 

3

 

 

 

 

 

 

 

B2

 

 

 

 

 

 

 

 

B1

 

 

AB =

A1 A2

Ar

6

B

r

7

 

6

 

7

 

6 ...

 

7

 

 

 

 

 

 

 

4

 

 

5

 

=A1B1 + A2B2 + + ArBr

Xr

=AjBj

j=1

An important property of the identity matrix is that if A is k r; then AIr = A and IkA = A: The k r matrix A, r k, is called orthogonal if A0A = Ir:

A.4 Trace

The trace of a k k square matrix A is the sum of its diagonal elements

Xk

tr (A) = aii:

i=1

Some straightforward properties for square matrices A and B and real c are

tr (cA) = c tr (A)

tr A0

= tr (A)

+ B) =

tr (A) + tr (B)

tr (A

 

 

tr (Ik)

=

k:

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]