R语言之逐步回归

标签:
大数据分析数据科学家数据分析师数据挖掘 |
逐步回归就是从自变量x中挑选出对y有显著影响的变量,已达到最优
http://www.cda.cn/uploadfile/image/20180318/20180318121348_87669.png
用step()函数
导入数据集
cement<-data.frame(
)
> lm.sol<-lm(Y ~ X1+X2+X3+X4, data=cement)
> summary(lm.sol)
Call:
lm(formula = Y ~ X1 + X2 + X3 + X4, data = cement)
Residuals:
-3.1750 -1.6709
Coefficients:
(Intercept)
X1
X2
X3
X4
---
Signif. codes:
Residual standard error: 2.446 on 8 degrees of freedom
Multiple R-squared:
F-statistic: 111.5 on 4 and 8 DF,
可以看出效果不明显
所以要进行逐步回归进行变量的筛选有forward:向前,backward:向后,both:2边,默认情况both
lm.step<-step(lm.sol)
Start:
Y ~ X1 + X2 + X3 + X4
- X3
- X4
- X2
- X1
Step:
Y ~ X1 + X2 + X4
- X4
- X2
- X1
> lm.step$anova
1
2 - X3
显然去掉X3会降低AIC
此时step()函数会帮助我们自动去掉X3
summary(lm.step)
Call:
lm(formula = Y ~ X1 + X2 + X4, data = cement)
Residuals:
-3.0919 -1.8016
Coefficients:
(Intercept)
X1
X2
X4
---
Signif. codes:
Residual standard error: 2.309 on 9 degrees of freedom
Multiple R-squared:
F-statistic: 166.8 on 3 and 9 DF,
很显然X2和X4效果不好
可以用add1()和drop1()函数进行增减删除函数
> drop1(lm.step)
Single term deletions
Model:
Y ~ X1 + X2 + X4
X1
X2
X4
我们知道除了AIC标准外,残差和也是重要标准,除去x4后残差和变为9.93
更新式子
> lm.opt<-lm(Y ~ X1+X2, data=cement)
> summary(lm.opt)
Call:
lm(formula = Y ~ X1 + X2, data = cement)
Residuals:
-2.893 -1.574 -1.302
Coefficients:
(Intercept) 52.57735
X1
X2
---
Signif. codes:
Residual standard error: 2.406 on 10 degrees of freedom
Multiple R-squared:
F-statistic: 229.5 on 2 and 10 DF,
显然效果很好