统计建模与R软件第六章习题答案(回归分析)
 (2012-03-02 12:19:49)
	
			
					(2012-03-02 12:19:49)		| 标签: r | 分类: R | 
			Ex6.1
(1)
> x <- c(5.1, 3.5, 7.1, 6.2, 8.8, 7.8, 4.5, 5.6, 8.0, 6.4)
> y <- c(1907, 1287, 2700, 2373, 3260, 3000, 1947, 2273, 3113,2493)
> plot(x,y)
http://s14/middle/681aaa554ba006061415d&690
由此判断,Y和X有线性关系。
(2)
> lm.sol<-lm(y~1+x)
> summary(lm.sol)
Call:
lm(formula = y ~ 1 + x)
Residuals:
    
Min      
1Q  
Median      
3Q     
Max                       
-128.591
-70.978  
-3.727   49.263 
167.228      
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)                
(Intercept) 
140.95    
125.11  
1.127   
0.293                
x           
364.18     
19.26  18.908 6.33e-08 ***                  
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 96.42 on 8 degrees of freedom
Multiple R-squared: 0.9781,   
Adjusted R-squared: 0.9754    
F-statistic: 357.5 on 1 and 8 DF, p-value: 6.33e-08 
回归方程为 Y=140.95+364.18X
(3)
β1项很显著,但常数项β0不显著。
回归方程很显著。
(4)
> new <- data.frame(x=7)
> lm.pred<-predict(lm.sol,new,interval="prediction")
> lm.pred
      
fit     
lwr     
upr                
1 2690.227 2454.971 2925.484
故Y(7)= 2690.227, [2454.971,2925.484]
Ex6.2
(1)
>pho<-data.frame(x1 <- c(0.4,0.4,3.1,0.6,4.7,1.7,9.4,10.1,11.6,12.6,10.9,23.1,23.1,21.6,23.1,1.9,26.8,29.9), x2 <- c(52,34,19,34,24,65,44,31,29,58,37,46,50,44,56,36,58,51), x3 <- c(158,163,37,157,59,123,46,117,173,112,111,114,134,73,168,143,202,124), y <- c(64,60,71,61,54,77,81,93,93,51,76,96,77,93,95,54,168,99))
> lm.sol<-lm(y~x1+x2+x3,data=pho)
> summary(lm.sol)
Call:
lm(formula = y ~ x1 + x2 + x3, data = pho)
Residuals:
   
Min     
1Q 
Median     
3Q    
Max                  
-27.575 -11.160 -2.799 
11.574  48.808   
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)              
(Intercept)
44.9290   
18.3408   2.450 
0.02806 *       
x1          
1.8033    
0.5290   3.409 
0.00424 **                  
x2         
-0.1337    
0.4440  -0.301 
0.76771                   
x3          
0.1668    
0.1141   1.462 
0.16573                     
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 19.93 on 14 degrees of freedom
Multiple R-squared: 0.551,    
Adjusted R-squared: 0.4547     
F-statistic: 5.726 on 3 and 14 DF, p-value: 0.009004 
回归方程为 y=44.9290+1.8033x1-0.1337x2+0.1668x3
(2)
回归方程显著,但有些回归系数不显著。
(3)
> lm.step<-step(lm.sol)
Start:
AIC=111.2 
y ~ x1 + x2 + x3
      
Df Sum of
Sq    
RSS    
AIC              
- x2  
1     
36.0  5599.4  
109.3           
<none>             
5563.4   111.2                
- x3  
1    
849.8  6413.1  
111.8          
- x1  
1    4617.8
10181.2   120.1        
Step:
AIC=109.32 
y ~ x1 + x3
      
Df Sum of
Sq    
RSS    
AIC              
<none>             
5599.4   109.3                
- x3  
1    
833.2  6432.6  
109.8          
- x1  
1    5169.5
10768.9   119.1        
> summary(lm.step)
Call:
lm(formula = y ~ x1 + x3, data = pho)
Residuals:
   
Min     
1Q 
Median     
3Q    
Max                  
-29.713 -11.324 -2.953 
11.286  48.679   
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)              
(Intercept)
41.4794   
13.8834   2.988 
0.00920 **       
x1          
1.7374    
0.4669   3.721 
0.00205 **                  
x3          
0.1548    
0.1036   1.494 
0.15592                     
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 19.32 on 15 degrees of freedom
Multiple R-squared: 0.5481,   
Adjusted R-squared: 0.4878    
F-statistic: 9.095 on 2 and 15 DF, p-value: 0.002589 
x3仍不够显著。
再用drop1函数做逐步回归。
> drop1(lm.step)
Single term deletions
Model:
y ~ x1 + x3
      
Df Sum of
Sq    
RSS    
AIC              
<none>             
5599.4   109.3                
x1    
1    5169.5
10768.9   119.1          
x3    
1    
833.2  6432.6  
109.8            
可以考虑再去掉x3.
> lm.opt<-lm(y~x1,data=pho);summary(lm.opt)
Call:
lm(formula = y ~ x1, data = pho)
Residuals:
   
Min     
1Q 
Median     
3Q    
Max                  
-31.486
-8.282  -1.674  
5.623  59.337     
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)                
(Intercept)
59.2590    
7.4200   7.986 5.67e-07
***       
x1          
1.8434    
0.4789   3.849 
0.00142 **                  
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 20.05 on 16 degrees of freedom
Multiple R-squared: 0.4808,   
Adjusted R-squared: 0.4484    
F-statistic: 14.82 on 1 and 16 DF, p-value: 0.001417 
皆显著。
Ex6.3
> x<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6,7,7,7,8,8,8,9,11,12,12,12)
> y<-c(0.6,1.6,0.5,1.2,2.0,1.3,2.5,2.2,2.4,1.2,3.5,4.1,5.1,5.7,3.4,9.7,8.6,4.0,5.5,10.5,17.5,13.4,4.5,30.4,12.4,13.4,26.2,7.4)
> plot(x,y)
> lm.sol<-lm(y~1+x)
> summary(lm.sol)
Call:
lm(formula = y ~ 1 + x)
Residuals:
   
Min     
1Q 
Median     
3Q    
Max                  
-9.8413 -2.3369 -0.0214 1.0592 17.8320 
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)                
(Intercept)
-1.4519    
1.8353 
-0.791   
0.436              
x           
1.5578    
0.2807   5.549 7.93e-06
***                  
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 5.168 on 26 degrees of freedom
Multiple R-squared: 0.5422,   
Adjusted R-squared: 0.5246    
F-statistic:
30.8 on 1 and 26 DF,  p-value:
7.931e-06  
线性回归方程为 y=-1.4519+1.5578x,通过F 检验。 常数项参数未通过t 检验。
> abline(lm.sol)
http://s3/middle/681aaa554ba29a2d63852&690
> y.yes<-resid(lm.sol)
> y.fit<-predict(lm.sol)
> y.rst<-rstandard(lm.sol)
> plot(y.yes~y.fit)
http://s8/middle/681aaa554ba29c3200bf7&690
> plot(y.rst~y.fit)
http://s10/middle/681aaa554ba29c20b25b9&690
残差并非是等方差的。
修正模型,对相应变量Y做开方。
> lm.new<-update(lm.sol,sqrt(.)~.)
> summary(lm.new)
Call:
lm(formula = sqrt(y) ~ x)
Residuals:
    
Min      
1Q  
Median      
3Q     
Max                       
-1.54255 -0.45280 -0.01177 0.34925 
2.12486  
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)                
(Intercept)
0.76650   
0.25592   2.995 
0.00596 **       
x          
0.29136   
0.03914   7.444 6.64e-08
***                
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 0.7206 on 26 degrees of freedom
Multiple R-squared: 0.6806,   
Adjusted R-squared: 0.6684    
F-statistic: 55.41 on 1 and 26 DF, p-value: 6.645e-08 
此时所有参数和方程均通过检验。
对新模型做标准化残差图,情况有所改善,不过还是存在一个离群值。第24和第28个值存在问题。
http://s11/middle/681aaa554ba29e4bc2f9a&690
Ex6.4
> toothpaste<-data.frame( X1=c(-0.05, 0.25,0.60,0,0.20, 0.15,-0.15, 0.15,0.10,0.40,0.45,0.35,0.30, 0.50,0.50,0.40,-0.05,-0.05,-0.10,0.20,0.10,0.50,0.60,-0.05,0,0.05,0.55),X2=c(5.50,6.75,7.25,5.50,6.50,6.75,5.25,6.00,6.25,7.00,6.90,6.80,6.80,7.10,7.00,6.80,6.50,6.25,6.00,6.50,7.00,6.80,6.80,6.50,5.75,5.80,6.80),Y=c(7.38,8.51,9.52,7.50,8.28,8.75,7.10,8.00,8.15,9.10,8.86,8.90,8.87,9.26,9.00,8.75,7.95, 7.65,7.27,8.00,8.50,8.75,9.21,8.27,7.67,7.93,9.26))
> lm.sol<-lm(Y~X1+X2,data=toothpaste); summary(lm.sol)
Call:
lm(formula = Y ~ X1 + X2, data = toothpaste)
Residuals:
    
Min      
1Q  
Median      
3Q     
Max                       
-0.37130 -0.10114 0.03066 
0.10016  0.30162   
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)                
(Intercept) 
4.0759    
0.6267   6.504 1.00e-06
***        
X1          
1.5276    
0.2354   6.489 1.04e-06
***                 
X2          
0.6138    
0.1027   5.974 3.63e-06
***                 
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 0.1767 on 24 degrees of freedom
Multiple R-squared: 0.9378,   
Adjusted R-squared: 0.9327    
F-statistic: 
181 on 2 and 24 DF,  p-value:
3.33e-15   
回归诊断:
> influence.measures(lm.sol)
Influence measures of
        
lm(formula = Y ~ X1 + X2, data = toothpaste) :        
    
dfb.1_  
dfb.X1  
dfb.X2   dffit
cov.r  
cook.d    hat
inf               
1 
0.00908  0.00260 -0.00847  0.0121
1.366 5.11e-05
0.1681         
2 
0.06277  0.04467 -0.06785 -0.1244 1.159 5.32e-03
0.0537        
3
-0.02809  0.07724 
0.02540  0.1858 1.283 1.19e-02
0.1386         
4 
0.11688  0.05055 -0.11078  0.1404
1.377 6.83e-03 0.1843  
*      
5 
0.01167  0.01887 -0.01766 -0.1037 1.141 3.69e-03
0.0384        
6 -0.43010
-0.42881  0.45774  0.6061 0.814
1.11e-01
0.0936        
7 
0.07840  0.01534 -0.07284  0.1082
1.481 4.07e-03 0.2364  
*      
8 
0.01577  0.00913 -0.01485  0.0208
1.237 1.50e-04
0.0823         
9 
0.01127 -0.02714 -0.00364  0.1071 1.156 3.95e-03
0.0466        
10 -0.07830
0.00171  0.08052  0.1890 1.155
1.22e-02
0.0726        
11 0.00301
-0.09652 -0.00365 -0.2281 1.127 1.76e-02
0.0735      
12 -0.03114
0.01848  0.03459  0.1542 1.132
8.12e-03
0.0514        
13 -0.09236 -0.03801 0.09940  0.2201 1.071
1.62e-02
0.0522       
14 -0.02650
0.03434  0.02606  0.1179 1.235
4.81e-03
0.0956        
15 0.00968
-0.11445 -0.00857 -0.2545 1.150 2.19e-02
0.0910      
16 -0.00285 -0.06185 0.00098 -0.1608 1.146 8.83e-03
0.0594      
17
0.07201  0.09744 -0.07796 -0.1099 1.364 4.19e-03
0.1731       
18
0.15132  0.30204 -0.17755 -0.3907 1.087 5.04e-02
0.1085       
19
0.07489  0.47472 -0.12980 -0.7579 0.731 1.66e-01
0.1092       
20
0.05249  0.08484 -0.07940 -0.4660 0.625 6.11e-02
0.0384   *    
21
0.07557  0.07284 -0.07861 -0.0880 1.471 2.69e-03
0.2304   *    
22 -0.17959 -0.39016 0.18241 -0.5494 0.912 9.41e-02
0.1022      
23
0.06026  0.10607 -0.06207  0.1251
1.374 5.42e-03
0.1804        
24 -0.54830 -0.74197 0.59358  0.8371 0.914
2.13e-01
0.1731       
25
0.08541  0.01624 -0.07775  0.1314
1.249 5.97e-03
0.1069        
26
0.32556  0.11734 -0.30200  0.4480
1.018 6.49e-02
0.1033        
27
0.17243  0.32754 -0.17676  0.4127
1.148 5.66e-02
0.1369        
> source("Reg_Diag.R");Reg_Diag(lm.sol) #薛毅老师自己写的程序
     
residual s1   
standard
s2    
student s3 hat_matrix
s4     
DFFITS s5                 
1 
0.00443843    
0.02753865    
0.02695925   
0.16811819    
0.01211949                    
2
-0.09114255   
-0.53021138   
-0.52211469   
0.05369239   
-0.12436727                
3 
0.07726887    
0.47112863    
0.46335666   
0.13857353    
0.18584310                    
4 
0.04805665    
0.30111062    
0.29532912   
0.18427663    
0.14036860                    
5
-0.09130271   
-0.52689847   
-0.51881406   
0.03838430   
-0.10365442                
6 
0.30162101    
1.79287913    
1.88596579   
0.09362223    
0.60613406                    
7 
0.03066005    
0.19855842    
0.19453763   
0.23641540  * 
0.10824626                  
8 
0.01199519    
0.07085860    
0.06937393   
0.08226537    
0.02077047                    
9 
0.08491891    
0.49217591    
0.48426323   
0.04664158    
0.10711246                    
10
0.11625405    
0.68315814    
0.67537315   
0.07261134    
0.18897969                   
11 -0.13874451  
-0.81570765   
-0.80983786   
0.07348894   
-0.22807820               
12
0.11540228    
0.67051940    
0.66263761   
0.05137589    
0.15420864                   
13
0.16178406    
0.94041623    
0.93806144   
0.05219432    
0.22013204                   
14
0.06210727    
0.36957277    
0.36282531   
0.09557411    
0.11794546                   
15 -0.13650951  
-0.81026658   
-0.80428349   
0.09101221   
-0.25449541               
16 -0.11097950  
-0.64757782   
-0.63955524   
0.05943308   
-0.16076716               
17 -0.03939381  
-0.24515626   
-0.24029557   
0.17309048   
-0.10993940               
18 -0.18593575  
-1.11438446   
-1.12029013   
0.10845395   
-0.39073410               
19 -0.33609591  
-2.01522068  * -2.16439284  *
0.10922236   
-0.75789180  *         
20 -0.37130271 * -2.14274943  *
-2.33258738  *
0.03838430   
-0.46603012         
21 -0.02545527  
-0.16420856   
-0.16084153   
0.23042354  *
-0.08801075             
22 -0.26374306  
-1.57517595   
-1.62848498   
0.10217431   
-0.54936198               
23
0.04349338    
0.27187605    
0.26656251   
0.18041800    
0.12506702                   
24
0.28060619    
1.74627363    
1.82969510   
0.17309048    
0.83711731  *                 
25
0.06459859    
0.38683016    
0.37987153   
0.10691352    
0.13143357                   
26
0.21752520    
1.29995371    
1.31989945   
0.10329116    
0.44796770                   
27
0.16987516    
1.03474390    
1.03633781   
0.13685835    
0.41266341                   
  
cooks_distance s6  COVRATIO s7   
1  
5.108777e-05   
1.3656752         
2  
5.316885e-03   
1.1589547         
3  
1.190200e-02   
1.2827036         
4  
6.827446e-03   
1.3771332         
5  
3.693897e-03   
1.1410104         
6  
1.106753e-01   
0.8143179         
7  
4.068871e-03   
1.4806452  *       
8  
1.500251e-04   
1.2372586         
9  
3.950358e-03   
1.1560508         
10 
1.218047e-02   
1.1550557        
11 
1.759216e-02   
1.1271148        
12 
8.116460e-03   
1.1316638        
13 
1.623390e-02   
1.0710597        
14 
4.811117e-03   
1.2349272        
15 
2.191171e-02   
1.1501502        
16 
8.832858e-03   
1.1457602        
17 
4.193532e-03   
1.3637206        
18 
5.035591e-02   
1.0866343        
19 
1.659840e-01   
0.7313914        
20 
6.109050e-02   
0.6248838        
21 
2.691197e-03   
1.4714103        
22 
9.412101e-02   
0.9121786        
23 
5.423856e-03   
1.3735324        
24 
2.127740e-01  *
0.9139942      
25 
5.971157e-03   
1.2485557        
26 
6.488529e-02   
1.0178195        
27 
5.658922e-02   
1.1479080        
为什么两种方法检测结果不一样呢...不继续了
Ex6.5
> cement<-data.frame(X1=c( 7, 1, 11,
11,  7, 11,  3, 
1,  2, 21,  1, 11, 10),X2=c(26,
29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68),X3=c( 6,
15,  8,  8, 
6,  9, 17, 22, 18,  4,
23,  9,  8),X4=c(60, 52, 20, 47,
33, 22,  6, 44, 22, 26, 34, 12, 12),Y =c(78.5,
74.3, 104.3,  87.6,  95.9, 109.2,
102.7, 72.5, 93.1,115.9,  83.8, 113.3,
109.4))                 
> xx<-cor(cement[1:4])
> kappa(xx,exact=T)
[1] 1376.881
大于1000,认为有严重的多重共线性。
> eigen(xx)
$values
[1] 2.235704035 1.576066070 0.186606149 0.001623746
$vectors
          
[,1]      
[,2]      
[,3]     
[,4]                           
[1,] -0.4759552 0.5089794  0.6755002
0.2410522  
[2,] -0.5638702 -0.4139315 -0.3144204 0.6417561
[3,]
0.3940665 -0.6049691  0.6376911
0.2684661  
[4,]
0.5479312  0.4512351 -0.1954210
0.6767340  
即2.235704035X1+1.576066070X2+0.186606149X3+0.001623746X4=0
可以忽略X4项,可以看出X1,X2,X3存在共线性。
删去X3和X4后:
> xx<-cor(cement[1:2])
> kappa(xx,exact=T)
[1] 1.59262
共线性消失了。
如果去掉X3呢?
> cement1<-cbind(cement$X1,cement$X2,cement$X4)
> xx<-cor(cement1)
> kappa(xx,exact=T)
[1] 77.25113
如果去掉X4呢?
> xx<-cor(cement[1:3])
> kappa(xx,exact=T)
[1] 11.12112
看起来去掉X3和X4是合理的。
Ex6.6
>infection<-data.frame(x1<-c(0,1,0,0,0,1,1,1),x2<-c(0,0,1,0,1,0,1,1),x3<-c(0,0,0,1,1,1,0,1),success<-c(0,0,23,8,28,0,11,1),fail<-c(9,0,3,32,30,2,87,17))
> infection$Ymat<-cbind(infection$success,infection$fail)
> glm.sol<-glm(Ymat~x1+x2+x3,family=binomial,data=infection)
> summary(glm.sol)
Call:
glm(formula = Ymat ~ x1 + x2 + x3, family = binomial, data = infection)
Deviance Residuals:
      
1        
2        
3        
4        
5        
6        
7        
8                                                               
-2.56229 
0.00000  
1.49623  
1.21563  -0.78520 
-0.15231 
-0.07162  
0.26470            
Coefficients:
           
Estimate Std. Error z value
Pr(>|z|)                
(Intercept)
-0.8207    
0.4947  -1.659  
0.0971 .         
x1         
-3.2544    
0.4813  -6.761 1.37e-11 ***               
x2          
2.0299    
0.4553   4.459 8.25e-06
***                 
x3         
-1.0720    
0.4254  -2.520  
0.0117 *                  
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
(Dispersion parameter for binomial family taken to be 1)
   
Null deviance: 83.491  on 6 
degrees of freedom     
Residual deviance: 10.997 on 3  degrees of
freedom  
AIC: 36.178
Number of Fisher Scoring iterations: 5
回归模型为
P=exp(-0.8207-3.2544x1+2.0299x2-1.0720x3)/(1+exp(-0.8207-3.2544x1+2.0299x2-1.0720x3))
Ex6.7
(1)
> x<-c(rep(0:6,rep(2,7)))
> y<-c(508.1,498.4,568.2,577.3,651.7,657.0,713.4,697.5,755.3,758.9,787.6,792.1,841.4,831.8)
> lm.sol<-lm(y~1+x)
> summary(lm.sol)
Call:
lm(formula = y ~ 1 + x)
Residuals:
   
Min     
1Q 
Median     
3Q    
Max                  
-25.400 -11.484 -3.779 
14.629  24.921   
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)                
(Intercept)
523.800     
8.474   61.81 
< 2e-16 ***         
x           
54.893     
2.350   23.36 2.26e-11
***                   
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 17.59 on 12 degrees of freedom
Multiple R-squared: 0.9785,   
Adjusted R-squared: 0.9767    
F-statistic: 545.5 on 1 and 12 DF, p-value: 2.265e-11 
线性回归模型为y=523.800+54.893x,通过t检验和F检验。
(2)
> lm.sol<-lm(y~1+x+I(x^2));summary(lm.sol)
Call:
lm(formula = y ~ 1 + x + I(x^2))
Residuals:
    
Min      
1Q  
Median      
3Q     
Max                       
-10.6643
-5.6622 
-0.4655  
5.5000  10.6679     
Coefficients:
           
Estimate Std. Error t value
Pr(>|t|)                
(Intercept) 502.5560   
4.8500 103.619  < 2e-16
***     
x          
80.3857    
3.7861  21.232 2.81e-10 ***                
I(x^2)     
-4.2488    
0.6063  -7.008 2.25e-05 ***           
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
Residual standard error: 7.858 on 11 degrees of freedom
Multiple R-squared: 0.9961,   
Adjusted R-squared: 0.9953    
F-statistic:
1391 on 2 and 11 DF,  p-value:
5.948e-14  
多项式回归模型为:
y=502.5560+80.3857x-4.2488x^2,通过t检验和F检验。
(3)作散点图和拟合曲线:
> plot(x,y)
> xfit<-seq(0,6,0.01)
> yfit<-predict(lm.sol,data.frame(x=xfit))
> lines(xfit,yfit)
http://s2/middle/681aaa554ba3f189c3811&690
Ex6.8
读入数据:
> cancer<-read.table("data",header=T)
> cancer
   x1 x2
x3 x4 x5 y  
1 70
64  5  1  1
1    
2 60
63  9  1  1
0    
3 70 65
11  1  1 0   
4 40 69
10  1  1 0   
5 40 63
58  1  1 0   
6 70
48  9  1  1
0    
7 70 48
11  1  1 0   
8 80
63  4  2  1
0    
9 60 63
14  2  1 0   
10 30 53
4  2  1 0   
11 80 43 12
2  1 0  
12 40 55
2  2  1 0   
13 60 66 25
2  1 1  
14 40 67 23
2  1 0  
15 20 61 19
3  1 0  
16 50 63
4  3  1 0   
17 50 66 16
0  1 0  
18 40 68 12
0  1 0  
19 80 41 12
0  1 1  
20 70 53
8  0  1 1   
21 60 37 13
1  1 0  
22 90 54 12
1  0 1  
23 50 52
8  1  0 1   
24 70 50
7  1  0 1   
25 20 65 21
1  0 0  
26 80 52 28
1  0 1  
27 60 70 13
1  0 0  
28 50 40 13
1  0 0  
29 70 36 22
2  0 0  
30 40 44 36
2  0 0  
31 30 54
9  2  0 0   
32 30 59 87
2  0 0  
33 40 69
5  3  0 0   
34 60 50 22
3  0 0  
35 80 62
4  3  0 0   
36 70 68 15
0  0 0  
37 30 39
4  0  0 0   
38 60 49 11
0  0 0  
39 80 64 10
0  0 1  
40 70 67 18
0  0 1  
> glm.sol<-glm(y~x1+x2+x3+x4+x5,family=binomial,data=cancer);summary(glm.sol)
Call:
glm(formula = y ~ x1 + x2 + x3 + x4 + x5, family = binomial,
   
data = cancer)   
Deviance Residuals:
    
Min       
1Q   
Median       
3Q      
Max                            
-1.71500
-0.66725 
-0.22254  
0.09936  
2.23936       
Coefficients:
           
Estimate Std. Error z value
Pr(>|z|)            
(Intercept) -7.01140  
4.47534 
-1.567  
0.1172       
x1         
0.09994   
0.04304  
2.322   0.0202 *                 
x2         
0.01415   
0.04697  
0.301  
0.7631                  
x3         
0.01749   
0.05458  
0.320  
0.7486                  
x4        
-1.08297   
0.58721 
-1.844   0.0651
.               
x5        
-0.61309   
0.96066 
-0.638  
0.5233                
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
(Dispersion parameter for binomial family taken to be 1)
   
Null deviance: 44.987  on 39 
degrees of freedom     
Residual deviance: 28.392 on 34  degrees of
freedom  
AIC: 40.392
Number of Fisher Scoring iterations: 6
有的系数并不显著。
下面做逐步回归:
> glm.new<-step(glm.sol)
Start:
AIC=40.39 
y ~ x1 + x2 + x3 + x4 + x5
      
Df Deviance   
AIC         
- x3  
1   28.484
38.484     
- x2  
1   28.484
38.484     
- x5  
1   28.799
38.799     
<none>    
28.392 40.392     
- x4  
1   32.642
42.642     
- x1  
1   38.306
48.306     
Step:
AIC=38.48 
y ~ x1 + x2 + x4 + x5
      
Df Deviance   
AIC         
- x2  
1   28.564
36.564     
- x5  
1   28.993
36.993     
<none>    
28.484 38.484     
- x4  
1   32.705
40.705     
- x1  
1   38.478
46.478     
Step:
AIC=36.56 
y ~ x1 + x4 + x5
      
Df Deviance   
AIC         
- x5  
1   29.073
35.073     
<none>    
28.564 36.564     
- x4  
1   32.892
38.892     
- x1  
1   38.478
44.478     
Step:
AIC=35.07 
y ~ x1 + x4
      
Df Deviance   
AIC         
<none>    
29.073 35.073     
- x4  
1   33.535
37.535     
- x1  
1   39.131
43.131     
只留下x1和x4两个变量。
> summary(glm.new)
Call:
glm(formula = y ~ x1 + x4, family = binomial, data = cancer)
Deviance Residuals:
   
Min      
1Q  
Median      
3Q     
Max                       
-1.4825
-0.6617 
-0.1877  
0.1227  
2.2844       
Coefficients:
           
Estimate Std. Error z value
Pr(>|z|)            
(Intercept) -6.13755  
2.73844 
-2.241   0.0250
*      
x1         
0.09759   
0.04079  
2.393   0.0167 *                 
x4        
-1.12524   
0.60239 
-1.868   0.0618
.               
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1 
(Dispersion parameter for binomial family taken to be 1)
   
Null deviance: 44.987  on 39 
degrees of freedom     
Residual deviance: 29.073 on 37  degrees of
freedom  
AIC: 35.073
Number of Fisher Scoring iterations: 6
回归方程为
P=exp(-6.13755+0.09759x1-1.12524x4)/(1+exp(-6.13755+0.09759x1-1.12524x4))
概率估计略。
Ex6.9
我表示不想做了...
我没弄明白nls()函数里所说的start即初始值怎么设置。好像可以随便设置,只要保证函数收敛即可??
							
		
						
		
		
		
		
		
		
							
		
				
		
				
	(1)
> x <- c(5.1, 3.5, 7.1, 6.2, 8.8, 7.8, 4.5, 5.6, 8.0, 6.4)
> y <- c(1907, 1287, 2700, 2373, 3260, 3000, 1947, 2273, 3113,2493)
> plot(x,y)
http://s14/middle/681aaa554ba006061415d&690
由此判断,Y和X有线性关系。
(2)
> lm.sol<-lm(y~1+x)
> summary(lm.sol)
Call:
lm(formula = y ~ 1 + x)
Residuals:
-128.591
Coefficients:
(Intercept)
x
---
Signif. codes:
Residual standard error: 96.42 on 8 degrees of freedom
Multiple R-squared: 0.9781,
F-statistic: 357.5 on 1 and 8 DF,
回归方程为 Y=140.95+364.18X
(3)
β1项很显著,但常数项β0不显著。
回归方程很显著。
(4)
> new <- data.frame(x=7)
> lm.pred<-predict(lm.sol,new,interval="prediction")
> lm.pred
1 2690.227 2454.971 2925.484
故Y(7)= 2690.227, [2454.971,2925.484]
Ex6.2
(1)
>pho<-data.frame(x1 <- c(0.4,0.4,3.1,0.6,4.7,1.7,9.4,10.1,11.6,12.6,10.9,23.1,23.1,21.6,23.1,1.9,26.8,29.9), x2 <- c(52,34,19,34,24,65,44,31,29,58,37,46,50,44,56,36,58,51), x3 <- c(158,163,37,157,59,123,46,117,173,112,111,114,134,73,168,143,202,124), y <- c(64,60,71,61,54,77,81,93,93,51,76,96,77,93,95,54,168,99))
> lm.sol<-lm(y~x1+x2+x3,data=pho)
> summary(lm.sol)
Call:
lm(formula = y ~ x1 + x2 + x3, data = pho)
Residuals:
-27.575 -11.160
Coefficients:
(Intercept)
x1
x2
x3
---
Signif. codes:
Residual standard error: 19.93 on 14 degrees of freedom
Multiple R-squared: 0.551,
F-statistic: 5.726 on 3 and 14 DF,
回归方程为 y=44.9290+1.8033x1-0.1337x2+0.1668x3
(2)
回归方程显著,但有些回归系数不显著。
(3)
> lm.step<-step(lm.sol)
Start:
y ~ x1 + x2 + x3
- x2
<none>
- x3
- x1
Step:
y ~ x1 + x3
<none>
- x3
- x1
> summary(lm.step)
Call:
lm(formula = y ~ x1 + x3, data = pho)
Residuals:
-29.713 -11.324
Coefficients:
(Intercept)
x1
x3
---
Signif. codes:
Residual standard error: 19.32 on 15 degrees of freedom
Multiple R-squared: 0.5481,
F-statistic: 9.095 on 2 and 15 DF,
x3仍不够显著。
再用drop1函数做逐步回归。
> drop1(lm.step)
Single term deletions
Model:
y ~ x1 + x3
<none>
x1
x3
可以考虑再去掉x3.
> lm.opt<-lm(y~x1,data=pho);summary(lm.opt)
Call:
lm(formula = y ~ x1, data = pho)
Residuals:
-31.486
Coefficients:
(Intercept)
x1
---
Signif. codes:
Residual standard error: 20.05 on 16 degrees of freedom
Multiple R-squared: 0.4808,
F-statistic: 14.82 on 1 and 16 DF,
皆显著。
Ex6.3
> x<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6,7,7,7,8,8,8,9,11,12,12,12)
> y<-c(0.6,1.6,0.5,1.2,2.0,1.3,2.5,2.2,2.4,1.2,3.5,4.1,5.1,5.7,3.4,9.7,8.6,4.0,5.5,10.5,17.5,13.4,4.5,30.4,12.4,13.4,26.2,7.4)
> plot(x,y)
> lm.sol<-lm(y~1+x)
> summary(lm.sol)
Call:
lm(formula = y ~ 1 + x)
Residuals:
-9.8413 -2.3369 -0.0214
Coefficients:
(Intercept)
x
---
Signif. codes:
Residual standard error: 5.168 on 26 degrees of freedom
Multiple R-squared: 0.5422,
F-statistic:
线性回归方程为 y=-1.4519+1.5578x,通过F 检验。 常数项参数未通过t 检验。
> abline(lm.sol)
http://s3/middle/681aaa554ba29a2d63852&690
> y.yes<-resid(lm.sol)
> y.fit<-predict(lm.sol)
> y.rst<-rstandard(lm.sol)
> plot(y.yes~y.fit)
http://s8/middle/681aaa554ba29c3200bf7&690
> plot(y.rst~y.fit)
http://s10/middle/681aaa554ba29c20b25b9&690
残差并非是等方差的。
修正模型,对相应变量Y做开方。
> lm.new<-update(lm.sol,sqrt(.)~.)
> summary(lm.new)
Call:
lm(formula = sqrt(y) ~ x)
Residuals:
-1.54255 -0.45280 -0.01177
Coefficients:
(Intercept)
x
---
Signif. codes:
Residual standard error: 0.7206 on 26 degrees of freedom
Multiple R-squared: 0.6806,
F-statistic: 55.41 on 1 and 26 DF,
此时所有参数和方程均通过检验。
对新模型做标准化残差图,情况有所改善,不过还是存在一个离群值。第24和第28个值存在问题。
http://s11/middle/681aaa554ba29e4bc2f9a&690
Ex6.4
> toothpaste<-data.frame( X1=c(-0.05, 0.25,0.60,0,0.20, 0.15,-0.15, 0.15,0.10,0.40,0.45,0.35,0.30, 0.50,0.50,0.40,-0.05,-0.05,-0.10,0.20,0.10,0.50,0.60,-0.05,0,0.05,0.55),X2=c(5.50,6.75,7.25,5.50,6.50,6.75,5.25,6.00,6.25,7.00,6.90,6.80,6.80,7.10,7.00,6.80,6.50,6.25,6.00,6.50,7.00,6.80,6.80,6.50,5.75,5.80,6.80),Y=c(7.38,8.51,9.52,7.50,8.28,8.75,7.10,8.00,8.15,9.10,8.86,8.90,8.87,9.26,9.00,8.75,7.95, 7.65,7.27,8.00,8.50,8.75,9.21,8.27,7.67,7.93,9.26))
> lm.sol<-lm(Y~X1+X2,data=toothpaste); summary(lm.sol)
Call:
lm(formula = Y ~ X1 + X2, data = toothpaste)
Residuals:
-0.37130 -0.10114
Coefficients:
(Intercept)
X1
X2
---
Signif. codes:
Residual standard error: 0.1767 on 24 degrees of freedom
Multiple R-squared: 0.9378,
F-statistic:
回归诊断:
> influence.measures(lm.sol)
Influence measures of
1
2
3
4
5
6
7
8
9
10 -0.07830
11
12 -0.03114
13 -0.09236 -0.03801
14 -0.02650
15
16 -0.00285 -0.06185
17
18
19
20
21
22 -0.17959 -0.39016
23
24 -0.54830 -0.74197
25
26
27
> source("Reg_Diag.R");Reg_Diag(lm.sol) #薛毅老师自己写的程序
1
2
3
4
5
6
7
8
9
10
11 -0.13874451
12
13
14
15 -0.13650951
16 -0.11097950
17 -0.03939381
18 -0.18593575
19 -0.33609591
20 -0.37130271
21 -0.02545527
22 -0.26374306
23
24
25
26
27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
为什么两种方法检测结果不一样呢...不继续了
Ex6.5
> cement<-data.frame(X1=c( 7,
> xx<-cor(cement[1:4])
> kappa(xx,exact=T)
[1] 1376.881
大于1000,认为有严重的多重共线性。
> eigen(xx)
$values
[1] 2.235704035 1.576066070 0.186606149 0.001623746
$vectors
[1,] -0.4759552
[2,] -0.5638702 -0.4139315 -0.3144204 0.6417561
[3,]
[4,]
即2.235704035X1+1.576066070X2+0.186606149X3+0.001623746X4=0
可以忽略X4项,可以看出X1,X2,X3存在共线性。
删去X3和X4后:
> xx<-cor(cement[1:2])
> kappa(xx,exact=T)
[1] 1.59262
共线性消失了。
如果去掉X3呢?
> cement1<-cbind(cement$X1,cement$X2,cement$X4)
> xx<-cor(cement1)
> kappa(xx,exact=T)
[1] 77.25113
如果去掉X4呢?
> xx<-cor(cement[1:3])
> kappa(xx,exact=T)
[1] 11.12112
看起来去掉X3和X4是合理的。
Ex6.6
>infection<-data.frame(x1<-c(0,1,0,0,0,1,1,1),x2<-c(0,0,1,0,1,0,1,1),x3<-c(0,0,0,1,1,1,0,1),success<-c(0,0,23,8,28,0,11,1),fail<-c(9,0,3,32,30,2,87,17))
> infection$Ymat<-cbind(infection$success,infection$fail)
> glm.sol<-glm(Ymat~x1+x2+x3,family=binomial,data=infection)
> summary(glm.sol)
Call:
glm(formula = Ymat ~ x1 + x2 + x3, family = binomial, data = infection)
Deviance Residuals:
-2.56229
Coefficients:
(Intercept)
x1
x2
x3
---
Signif. codes:
(Dispersion parameter for binomial family taken to be 1)
Residual deviance: 10.997
AIC: 36.178
Number of Fisher Scoring iterations: 5
回归模型为
P=exp(-0.8207-3.2544x1+2.0299x2-1.0720x3)/(1+exp(-0.8207-3.2544x1+2.0299x2-1.0720x3))
Ex6.7
(1)
> x<-c(rep(0:6,rep(2,7)))
> y<-c(508.1,498.4,568.2,577.3,651.7,657.0,713.4,697.5,755.3,758.9,787.6,792.1,841.4,831.8)
> lm.sol<-lm(y~1+x)
> summary(lm.sol)
Call:
lm(formula = y ~ 1 + x)
Residuals:
-25.400 -11.484
Coefficients:
(Intercept)
x
---
Signif. codes:
Residual standard error: 17.59 on 12 degrees of freedom
Multiple R-squared: 0.9785,
F-statistic: 545.5 on 1 and 12 DF,
线性回归模型为y=523.800+54.893x,通过t检验和F检验。
(2)
> lm.sol<-lm(y~1+x+I(x^2));summary(lm.sol)
Call:
lm(formula = y ~ 1 + x + I(x^2))
Residuals:
-10.6643
Coefficients:
(Intercept) 502.5560
x
I(x^2)
---
Signif. codes:
Residual standard error: 7.858 on 11 degrees of freedom
Multiple R-squared: 0.9961,
F-statistic:
多项式回归模型为:
y=502.5560+80.3857x-4.2488x^2,通过t检验和F检验。
(3)作散点图和拟合曲线:
> plot(x,y)
> xfit<-seq(0,6,0.01)
> yfit<-predict(lm.sol,data.frame(x=xfit))
> lines(xfit,yfit)
http://s2/middle/681aaa554ba3f189c3811&690
Ex6.8
读入数据:
> cancer<-read.table("data",header=T)
> cancer
1
2
3
4
5
6
7
8
9
10 30 53
11 80 43 12
12 40 55
13 60 66 25
14 40 67 23
15 20 61 19
16 50 63
17 50 66 16
18 40 68 12
19 80 41 12
20 70 53
21 60 37 13
22 90 54 12
23 50 52
24 70 50
25 20 65 21
26 80 52 28
27 60 70 13
28 50 40 13
29 70 36 22
30 40 44 36
31 30 54
32 30 59 87
33 40 69
34 60 50 22
35 80 62
36 70 68 15
37 30 39
38 60 49 11
39 80 64 10
40 70 67 18
> glm.sol<-glm(y~x1+x2+x3+x4+x5,family=binomial,data=cancer);summary(glm.sol)
Call:
glm(formula = y ~ x1 + x2 + x3 + x4 + x5, family = binomial,
Deviance Residuals:
-1.71500
Coefficients:
(Intercept) -7.01140
x1
x2
x3
x4
x5
---
Signif. codes:
(Dispersion parameter for binomial family taken to be 1)
Residual deviance: 28.392
AIC: 40.392
Number of Fisher Scoring iterations: 6
有的系数并不显著。
下面做逐步回归:
> glm.new<-step(glm.sol)
Start:
y ~ x1 + x2 + x3 + x4 + x5
- x3
- x2
- x5
<none>
- x4
- x1
Step:
y ~ x1 + x2 + x4 + x5
- x2
- x5
<none>
- x4
- x1
Step:
y ~ x1 + x4 + x5
- x5
<none>
- x4
- x1
Step:
y ~ x1 + x4
<none>
- x4
- x1
只留下x1和x4两个变量。
> summary(glm.new)
Call:
glm(formula = y ~ x1 + x4, family = binomial, data = cancer)
Deviance Residuals:
-1.4825
Coefficients:
(Intercept) -6.13755
x1
x4
---
Signif. codes:
(Dispersion parameter for binomial family taken to be 1)
Residual deviance: 29.073
AIC: 35.073
Number of Fisher Scoring iterations: 6
回归方程为
P=exp(-6.13755+0.09759x1-1.12524x4)/(1+exp(-6.13755+0.09759x1-1.12524x4))
概率估计略。
Ex6.9
我表示不想做了...
我没弄明白nls()函数里所说的start即初始值怎么设置。好像可以随便设置,只要保证函数收敛即可??

 加载中…
加载中…