首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

刍议：2SLS两阶段最小二乘法

(2015-12-30 17:20:10)

分类： 04STATA数据处理

Note: This model could also be fit with sem, using maximum likelihood instead of a two-step method.
You can find examples for recursive models fit with sem in the “Structural models 2: Dependencies between endogenous variables” section of [SEM] intro 5 — Tour of models.

Someone posed the following question:

I am estimating an equation:

        Y = a + bX + cZ + dW

I then want to instrument W with Q. I know the first-stage regression is supposed to be

        W = e + fX + gZ + hQ

(i.e., use all the exogenous variables in the first stage). Actually this is automatically done if I use the ivregress command. However, I only want to use Q to instrument W without using X and Z in the first stage. Is there a way I can do it in Stata? I can regress W on Q and get the predicted W, and then use it in the second-stage regression. The standard errors will, however, be incorrect.

ivregress will not let you do this and, moreover, if you believe W to be endogenous because it is part of a system, then you must include X and Z as instruments, or you will get biased estimates for b, c, and d.

Consider the system

        Y1 = a0 + a1*Y2 + a2*X1 + a3*X2 + e1               (1)

        Y2 = b0 + b1*Y1 + b2*X3 + b3*X4 + e2               (2)

Warning: Assume we are estimating structural equation (1); if X1 and X2 are exogenous, then they must be kept as instruments or your estimates will be biased. In a general system, such exogenous variables must be used as instruments for any endogenous variables when the instrumented value for the endogenous variables appears in an equation in which the exogenous variable also appears.

Consider the reduced forms of your two equations:

        Y1 = e0 + e1*X1 + e2*X2 + e3*X3 + e4*x4 + u1        (1r)

        Y2 = f0 + f1*X1 + f2*X2 + f3*X3 + f4*x4 + u2        (2r)

where e# and f# are combinations of the a# and b# coefficients from (1) and (2) and u1 and u2 are linear combinations of e1 and e2.

All exogenous variables appear in each equation for an endogenous variable. This is the nature of simultaneous systems, so efficiency argues that all exogenous variables be included as instruments for each endogenous variable.

Here is the real problem. Take (1): the reduced-form equation for Y2, (2r), clearly shows that Y2 is correlated with X2 (by the coefficient f2). If we do not include X2 among the instruments for Y2, then we will have failed to account for the correlation of Y2 with X2 in its instrumented values. Since we did not account for this correlation, when we estimate (1) with the instrumented values for Y2, the coefficient a3 will be forced to account for this correlation. This approach will lead to biased estimates of both a1 and a3.

For a brief reference, see Baltagi (2011). See the whole discussion of 2SLS, particularly the paragraph after equation 11.40, on page 265. (I have no idea why this issue is not emphasized in more books.)

Failing to include X4 affects only efficiency and not bias.

However, there is one case where it is not necessary to include X1 and X2 as instruments for Y2. That is when the system is triangular such that Y2 does not depend on Y1, but you believe it is weakly endogenous because the disturbances are correlated between the equations. You are still consistent here to do what ivregress does and retain X1 and X2 as instruments. They are, however, no longer required. Then you could do what you suggested and just regress on the predicted instruments from the first stage.

If you do use this method of indirect least squares, you will have to perform the adjustment to the covariance matrix yourself. Consider the structural equation

        y1 = y2 + x1 + e

where you have an instrument z1 and you do not think that y2 is a function of y1.

The following example uses only z1 as an instrument for y2. Let’s begin by creating a dataset (containing made-up data) on y1, y2, x1, and z1:

. sysuse auto (1978
Automobile Data) . rename price
y1 . rename mpg y2 . rename displacement z1 . rename turn
x1

Now we perform the first-stage regression and get predictions for the instrumented variable, which we must do for each endogenous right-hand-side variable.

. regress y2 z1

Source

SS df MS
Number of obs = 74

F( 1, 72) = 71.41

Model

1216.67534 1 1216.67534
Prob > F = 0.0000

Residual

1226.78412 72 17.0386683
R-squared = 0.4979

Adj R-squared = 0.4910

Total

2443.45946 73 33.4720474
Root MSE = 4.1278

y2

Coef. Std. Err. t P>|t| [95%
Conf. Interval]

z1

-.0444536 .0052606 -8.45 0.000
-.0549405 -.0339668

_cons

30.06788 1.143462 26.30 0.000
27.78843 32.34733

. predict double y2hat
(option xb assumed; fitted values) * perform IV regression
. regress y1 y2hat x1

Source

SS df MS
Number of obs = 74

F( 2, 71) = 12.41

Model

164538571 2 82269285.5
Prob > F = 0.0000

Residual

470526825 71 6627138.38
R-squared = 0.2591

Adj R-squared = 0.2382

Total

635065396 73 8699525.97
Root MSE = 2574.3

y1

Coef. Std. Err. t P>|t| [95%
Conf. Interval]

y2hat

-463.4688 117.187 -3.95 0.000
-697.1329 -229.8046

x1

-126.4979 108.7468 -1.16 0.249
-343.3328 90.33697

_cons

21051.36 6451.837 3.26 0.002
8186.762 33915.96

Now we correct the variance–covariance by applying the correct mean squared error:

. rename y2hat y2hold . rename y2 y2hat . predict double res, residual . rename y2hat y2 . rename y2hold y2hat . replace res = res^2 (74 real changes made) . summarize res

Variable		Obs Mean Std. Dev. Min Max

res		74 7553657 1.43e+07 117.4375 1.06e+08

. scalar realmse = r(mean)*r(N)/e(df_r) . matrix bmatrix = e(b) . matrix Vmatrix = e(V) . matrix Vmatrix = e(V) * realmse / e(rmse)^2 . ereturn post bmatrix Vmatrix, noclear . ereturn display


		Coef. Std. Err. t P>\|t\| [95% Conf. Interval]

y2hat		-463.4688 127.7267 -3.63 0.001 -718.1485 -208.789
x1		-126.4979 118.5274 -1.07 0.289 -362.8348 109.8389
_cons		21051.36 7032.111 2.99 0.004 7029.73 35072.99

http://www.stata.com/support/faqs/statistics/instrumental-variables-regression/

本来我们完全可以使用 IVREG 这个简单的命令，但是如果工具变量只有一个，你们这里就给出了答案。

连玉君老师在讲义中也提到了，使用原始的方法容易犯的错误，主要是残差序列可能有偏。

*------------------------
* 两阶段最小二乘法（2SLS）
*------------------------

* 对于模型:
*
* y = x1*b1 + x2*b2 + e  假设 Corr(x2,e)!=0
*
*  若存在两个工具变量 z1 和 z2，我们我将得到两个 IV 估计量，
*  问题：如何将这两个IV估计量合并起来？

*-- 解决方法：两阶段最小二乘法——2SLS
* 第一步：
*    reg x2 on z1 z2, 得到 x2 的拟合值 x_2，x_2 可视为 x2 的工具变量
* 第二步：
*    reg y  on x1 x_2, 即执行 IV 估计。
*
* 特别说明：
*    虽然基本思想是这样的，但我们不能如此操作，因为这种方法是错误的！

*-- 理论推导：
*
* y = X*b + u                (1)
*
*-1 X = Z*b1 + u             (2)
*
*    X_hat = Z*b1_OLS          (3)
*          = Z*[inv(Z'Z)*Z'X]
*          = P_z*X  (其中，P_z = Z*inv(Z'Z)*Z')
*
*-2 y = X*b + u
*    b_2SLS = inv(X_hat'*X)*X_hat'*y    (4)
*          = inv(X'*P_z*X)*X'*P_z*y
*
* Var(b_2SLS) = sigma^2*inv(X'*P_z*X)  (5)
*
* sigma^2 = e'*e/N (e 表示残差向量) (6)
*
* e = y - X*b_2SLS                   (7)

* 特别注意：
*    虽然从名称上来看，2SLS 似乎应该执行“两步法”，但这种做法是错误的；
*    正确的估计式是 (4) 和 (5)
*  如果采用两步法，得到的残差序列是错误的：
*    e = y - X_hat*b_2SLS
*  而正确的估计式应该是 (7) 式！

use yourdata,clear

regress x2 z1

predict double yhat

regress y yhat x1

rename yhat yhold

rename x2 yhat

predict double res, residual

rename yhat x2

rename yhold yhat

replace res = res^2

summarize res

scalar realmse = r(mean)*r(N)/e(df_r)

matrix bmatrix = e(b)

matrix Vmatrix = e(V)

matrix Vmatrix = e(V)*realmse/e(rmse)^2

scalar list realmse

matrix list Vmatrix

ereturn post bmatrix Vmatrix, noclear

ereturn display

手动计算2sls标准误 [推广有奖]

http://bbs.pinggu.org/thread-895167-2-1.html

如果不用ivreg/xtivreg，而是手动来做第2步的回归，应该怎么样计算标准误？软件给出的是不对的，要怎么调整？因为有些情况不能直接用ivreg/xtivreg，比如：1.内生变量是0/1值，第一步最好用probit/logit，而不是LMP; 2,数据是panel，但找的iv是time-invariant的，所以第一步用不了fe，只能在第二部用fe.谢谢解答！

*run以下全部，可以得到相同的结果：

ivregress 2sls y x1-x5 (a1-a3=z1-z4),small

mat w=(e(b)',vecdiag(e(V))')

n mat l w

foreach i of var a1-a3{

reg `i' x1-x5 z1-z4

predict `i'p

}

reg y a1p-a3p x1-x5

predict u,r

g u2=u*u

su u2

sca u2=r(mean)

predictnl e=y-_b[_cons]-x1*_b[x1]-x2*_b[x2]-x3*_b[x3]-x4*_b[x4]-x5*_b[x5]-a1*_b[a1p]-a2*_b[a2p]-a3*_b[a3p]

g e2=e*e

su e2

sca e2=r(mean)

mat v=(e(b)',vecdiag(e2*e(V)/u2)')

n mat l v

参考的例子：http://pan.baidu.com/s/1jHvIy6U

http://www.systemfit.org/

library(systemfit)

## Replicating the estimations in Kmenta (1986), p. 712, Tab 13-2

data( "Kmenta" )

eqDemand <- consump ~ price + income

eqSupply <- consump ~ price + farmPrice + trend

inst <- ~ income + farmPrice + trend

system <- list( demand = eqDemand, supply = eqSupply )

## OLS estimation

fitOls <- systemfit( system, data = Kmenta )

summary( fitOls )

## 2SLS estimation

fit2sls <- systemfit( system, "2SLS", inst = inst, data = Kmenta )

summary( fit2sls )

## 3SLS estimation

fit3sls <- systemfit( system, "3SLS", inst = inst, data = Kmenta )

summary( fit3sls )

## I3LS estimation

fitI3sls <- systemfit( system, "3SLS", inst = inst, data = Kmenta,

maxit = 250 )

summary( fitI3sls )

https://sites.google.com/site/econometricsacademy/econometrics-models/instrumental-variables

# Instrumental Variables in R

# install.packages("AER")

library(AER)

# install.packages("systemfit")

library(systemfit)

mydata <- read.csv("f:/iv_health.csv")

attach(mydata)

# Defining variables (Y1 dependent variable, Y2 endogenous variable)

# (X1 exogenous variables, X2 instruments, X2 instruments, overidentified case)

Y1 <- cbind(logmedexpense)

Y2 <- cbind(healthinsu)

X1 <- cbind(illnesses, age, logincome)

X2 <- cbind(ssiratio)

X2alt <- cbind(ssiratio, firmlocation)

# Descriptive statistics

summary(Y1)

summary(Y2)

summary(X1)

summary(X2)

# OLS regression

olsreg <- lm(Y1 ~ Y2 + X1)

summary(olsreg)

# 2SLS estimation

ivreg <- ivreg(Y1 ~ Y2 + X1 | X1 + X2)

summary(ivreg)

# 2SLS estimation (details)

olsreg1 <- lm (Y2 ~ X1 + X2)

summary(olsreg1)

Y2hat <- fitted(olsreg1)

olsreg2 <- lm(Y1 ~ Y2hat + X1)

summary(olsreg2)

# 2SLS estimation, over-identified case

ivreg_o <- ivreg(Y1 ~ Y2 + X1 | X1 + X2alt)

summary(ivreg_o)

# Hausman test for endogeneity of regressors

cf_diff <- coef(ivreg) - coef(olsreg)

vc_diff <- vcov(ivreg) - vcov(olsreg)

x2_diff <- as.vector(t(cf_diff) %*% solve(vc_diff) %*% cf_diff)

pchisq(x2_diff, df = 2, lower.tail = FALSE)

# Systems of equations

# Defining equations for systems of equations (2SLS and 3SLS)

# (X12 exogenous variable for eq2, X22 instrument for eq2)

X12 <- cbind(illnesses)

X22 <- cbind(firmlocation)

eq1 <- Y1 ~ Y2 + X1 + X2

eq2 <- Y2 ~ Y1 + X12 + X22

inst <- ~ X1 + X2 + X22

system <- list(eq1 = eq1, eq2 = eq2)

# 2SLS estimation

reg2sls <- systemfit(system, "2SLS", inst = inst, data = mydata)

summary(reg2sls)

# 3SLS estimation

reg3sls <- systemfit(system, "3SLS", inst = inst, data = mydata)

summary(reg3sls)

https://sites.google.com/site/econometricsacademy/

淘宝店网址

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：马尔科夫区制转换程序包问题

后一篇：SAS 批量修改字段名称

新浪BLOG意见反馈留言板　欢迎批评指正

Source	SS df MS	Number of obs = 74
		F( 1, 72) = 71.41
Model	1216.67534 1 1216.67534	Prob > F = 0.0000
Residual	1226.78412 72 17.0386683	R-squared = 0.4979
		Adj R-squared = 0.4910
Total	2443.45946 73 33.4720474	Root MSE = 4.1278


y2		Coef. Std. Err. t P>\|t\| [95% Conf. Interval]

z1		-.0444536 .0052606 -8.45 0.000 -.0549405 -.0339668
_cons		30.06788 1.143462 26.30 0.000 27.78843 32.34733


y1		Coef. Std. Err. t P>\|t\| [95% Conf. Interval]

y2hat		-463.4688 117.187 -3.95 0.000 -697.1329 -229.8046
x1		-126.4979 108.7468 -1.16 0.249 -343.3328 90.33697
_cons		21051.36 6451.837 3.26 0.002 8186.762 33915.96