加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)

(2012-10-24 14:41:49)
标签:

一元线性回归

excel

minitab

r软件

spss

一元线性回归 Simple Linear Regression

Regression with a Single Continuous Explanatory Variable

    1)理论模型形式(无残差项)y = a + bx

    2)预测模型形式(因变量为计算值,无残差项) 预测 Yi = a + bxi

                   (因变量为观测值,有残差项)观测 yi = a + bxi + εi

[x解释变量explanatory variable(also called a predictor or independent variable); y结果变量outcome variable (also called a response or dependent variable); a截距intercept, is the predicted value of y when x=0; b斜率slop, is the predicted change in y for a 1 unit change in x; 预测值/计算值,xi观测值,yi观测值,ε残差residualsεi = 观测yi -预测Yi]

We make the following assumptions about the residuals εi : (关于残差的3个假设)

The residuals are normally distributed with zero mean and variance σ2 (spoken as ‘sigma-squared’). This assumption is often written in shorthand as εi ~ N(0,σ2 

...........    [正态分布]

The variance of the residuals is constant, whatever the value of x. This means that if we take a slice through the scatter plot of y versus x at any particular value of x, the y values have approximately the same variation as at any other value of x. If the variance is constant, we say the residuals are homoskedastic. Otherwise they are said to be heteroskedastic 

  ...........    [等方差]

The residuals are not correlated with one another, i.e. they are independent. Correlations might arise if some individuals contribute more than one observation (e.g. repeated measures) or if individuals are clustered in some way (e.g. in schools). If it is suspected that residuals are correlated, the regression model needs to be modified, e.g. to a multilevel model.    

...........    [独立/不相关]

 

 

Important note[重要提示]: We cannot claim that there is a causal relationship between X and Y from such a simple model or, indeed, from any regression model applied to observational data. So when interpreting the slope it is better to avoid statements like ‘a change in X leads to or causes an increase in Y’. Taking account of other factors would provide stronger evidence of a causal relationship if the original relationship did not change as additional predictors are included in the model.

 

Explained and unexplained variance and R-squared

[能解释和不能解释的变异,以及R的平方]

All statistical models have a common form 所有统计模型都可以用以下通式表达:

Response = Systematic part + Random part

                            响应 = 系统部分 + 随机部分

 

where for a simple regression the systematic part is a + bxi and the random part is the residual εi.

The systematic part gives the average relationship between the response and the predictor(s), while the random part is what is left over (the unexplained part) after taking account of the included predictor(s).   

 ...........     [简单回归中, 系统部分就是截距+斜率*预测变量,  随机部分就是残差, 它是考虑预测变量后的剩余部分, 也就是不能解释的那一部分]

 

The term random means ‘allowed to vary’.    

 ...........     [随机项意味着'允许变化']

 

The residual is the difference between the actual and predicted value. In some cases there will be a close fit between the actual and fitted values. In other cases there may be a lot of ‘noise’ (e.g. if, for any given age, there is a wide range of hedonism scores). It is helpful to characterise this residual variability. To do so requires us to make some assumptions about the residuals (normality and homoskedasticity). Under these assumptions we can summarise the variability in a single statistic, the variance of the residuals σ2. We can think of the residual variance as the part of the variance in Y that is unexplained by X. The part of the variance in Y that is explained by X (the systematic part of the model) is called the explained variance in Y.

 

Another key summary statistic is the R-squared (R2) value which gives the correspondence between the actual and fitted values, on a scale between zero (no correspondence) and 1 (complete correspondence). R-squared can also be interpreted as the proportion of the total variance in Y that can be explained by variability in X.  

 ...........[R平方表示真值和拟合值之间的匹配程度, 也表示变量x对Y的解释力==占Y 的总方差的比例]

 

In the case of simple regression, R-squared is the square of the Pearson correlation coefficient.

 ...........[对简单回归而言, R平方就是Pearon相关系数的平方]

 

Hypothesis testing 假设检验

H0: b=0  VS.  Ha: b      原假设和备择假设

In the practice sections, we will generally use the Z-ratio (also called t-ratio) to test significance rather than calculating confidence intervals 实践中用Z-rato检验, 很少用置信区间检验

 

 

Model checking

To check assumptions about εi, we use the estimated residuals which are the differences between the observed and predicted values of y: redidual i observed Yi - predicted  Yi

 

We usually work with the standardized residuals ri which we obtain by dividing residual I by their standard deviation.

 

Checking the normality assumption

We can check whether residuals are normally distributed by looking at a histogram or a normal probability plot of the standardized residuals. If the normality assumption holds, the points in a normal plot should lie on a straight line.

用标准化残差的柱状图或正态概率图检查残差是否服从正态分布. 如果是正态分布, 正态图上的点分布在一条直线上.

 

Checking the homoskedasticity assumption

To check that the variance of the residuals is fairly constant across the range of X, we can examine a plot of the standardized residuals against X and check that the vertical scatter of the residuals is roughly the same for different values of X with no ‘funnelling’.

通过检查 标准化残差 VS. X   的图像, 察看X取不同值时, 垂直方向的残差散点是否近乎一致, 来检验在X的取值范围内残差的方差是否是常数.

 

Outliers

We can also check for outliers using any of the above residual plots. An outlier is a point with a particularly large residual. We would expect approximately 95% of the residuals to lie between –2 and +2.

还可以从以上任何残差图中检查异常值. 异常值是那些残差特别大的点. 期望约95%的残差居于正负2之间.

 

Of major interest, however, is whether an outlier has undue influence on our results. For example, in simple regression, an outlier with very large values on X and Y could push up a positive slope. A straightforward way to judge the influence of an outlier is to refit the regression line after excluding it. If the results are very similar to those based on all observations, we would conclude that the outlier does not have undue influence. An observation’s influence can also be measured by a statistic called Cook’s D.

通过统计量Cook's D来识别那些有异常(undue)影响的异常值的影响有多大.

 

 

 ----------------------------------实例-------------------------------

 

 一元线性回归分析的F检验、t检验都与相关系数检验等价,故通常只需要3个方面的检验:相关系数、标准误差和DW值检验(Durbin-Watson test,该检验的统计用表可以百度一下)。

 

数据 (在excel中单列, 太长, 修改成如下形式, 少占地方)

y

x

y

x

y

x

y

x

y

x

49.5

25.8

70.4

37.4

90.3

48.6

111.3

59.6

130.2

70.4

52.0

27.5

72.1

38.5

92.4

49.5

112.3

60.5

132.3

71.5

54.4

28.7

74.4

39.6

94.1

50.8

114.3

61.6

134.3

72.5

57.1

29.5

76.4

40.6

95.7

51.7

116.3

62.7

136.2

73.7

58.4

30.8

78.5

41.9

98.4

52.8

118.3

63.8

139.0

74.8

59.7

31.7

81.2

42.9

100.5

53.9

120.3

64.9

140.4

75.9

62.4

33.0

82.3

44.0

102.2

55.0

123.1

66.0

142.2

77.0

65.2

34.2

84.1

45.1

104.5

56.1

124.5

67.1

143.2

78.0

66.4

35.2

86.3

46.2

106.2

57.3

126.3

68.2

146.4

79.2

68.5

36.3

88.3

47.3

108.4

58.3

127.5

68.7

148.1

80.4

 

 

 

1 用excel 2007分析

在“数据”下拉菜单,可以找到“数据分析”选项框,左击之显示“分析工具A”,在里面找到“回归”分析工具:

 

http://s1/bmiddle/003koJA0zy6IwPsBQyI70&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

Y就是dependent or response variable, X 就是independent, explanatory, or predictor variable,要选中或者输入它们所在的区域。

 

标志:选中时,指定第一行是变量名

 

常数为零:选中时,拟合的模型没有截距。

 

置信度:默认值是95%。

 

输出选项:用于指定输出位置

 

残差:用于指定输出的残差图种类。

 

正态分布:选中后将给出正态概率图。

 

完成各个选项后,点击确定,即可得到相应的表格(SUMMARY OUTPUT, RESIDUAL OUTPUT, PROBABILITY OUTPUT),残差图等。

 

-------------------------------

http://s12/mw690/003koJA0zy6IwPZH52Pdb&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

http://s1/mw690/003koJA0zy6IwQ0kWJOe0&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

http://s16/mw690/003koJA0zy6IwQ0QGwn5f&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

------------------------------------

1.1 相关系数检验

相关系数(Multiple R)非常接近1R=0.99992>R0.0548=0.27871,检验通过。[查相关系数检验表,显著水平a=0.05、剩余自由度df = 48时,R=0.27871]

1.2 标准误差检验

s=0.3689,从而得到变异系数cv = [s  / (y的均值)] = 0.0037<10%15% = 0.10.15。检验通过 

1.3 DW检验

计算统计量 D 2.02,查表得到dU= 1.59dL = 1.50dUD(4- dU),所以无序列相关。

 

http://s5/mw690/003koJA0zy6IwQdgyu8a4&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />


1.4 残差图

残差: 观测值 (y) 与其相应拟合值 (ŷ) 之间的差。残差值在回归和方差分析过程中特别有用,因为残差值表示模型能在多大程度上解释观测数据中的变异

标准化残差: 有助于检测异常值。标准化残差等于残差值 ei 除以其标准差的估计值。通常将大于 2 和小于 -2 的标准化残差视为较大。标准化残差很有用,因为原始残差包含异方差,因此无法很好地指示异常值:相应 X 值远离 其均值 的残差的方差比相应 X 值接近 其均值 残差的方差要大。将此异方差的对照物标准化,所有标准化残差就具有相同的标准差。标准化残差也称为内部 t 化残差

t化删后残差: 有助于检测异常值。计算观测值的 t 化删后残差的方法是将观测值的删后残差除以其标准差的估计值。删后残差 di yi 与其在模型中的拟合值之差,该拟合值在计算中忽略了第 i 个观测值。忽略观测值是为了确定没有此潜在异常值时模型的行为。如果观测值的 t 化删后残差较大(如果其绝对值大于 2),则它可能是数据中的异常值。每个 t 化删后残差都服从具有 (n 1 p) 个自由度的 t 分布,其中 p 等于回归模型中的项数。t 化删后残差也称为外部 t 化残差删后 t 残差

 

---------------------下图是自己动手画的, 可能是版本有问题, 自动输出的图没法看---------------------

http://s4/mw690/003koJA0zy6IwQobV1V03&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

--------------------------------------------------------------------------------------------------

 

2 Minitab分析

可以从Excel表中直接把数据拷贝到Minitab的工作表里。然后选择“统计”菜单里的“回归”选项,排在最上面的那个图标是简单线性回归。选中后如下图所示:

http://s10/mw690/003koJA0zy6IwQxQbsR19&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

 

 

 

选项(N):里面有统计量

图形(G):供选的残差图有正规/标准化/删后三种

选择好后点击“确定”,输出结果见下面的蓝色字体部分(在“会话”窗,点右键,选择“发送节到Microsft Word”)。结果跟Excel的分析结果是一致的。数据结果后面附上了三种残差图。

 

回归分析:yx

 

回归方程为

y = 2.55 + 1.81 x

 

 

自变量     系数  系数标准误       T      P

常量     2.5536      0.1828   13.97  0.000

x       1.81460     0.00328  552.54  0.000

 

 

S = 0.368925   R-Sq = 100.0%   R-Sq(调整) = 100.0%

 

 

方差分析

 

来源      自由度     SS     MS          F      P

回归           1  41553  41553  305301.04  0.000

残差误差      48      7      0

合计          49  41560

 

 

异常观测值

 

                                拟合值          标准化

观测值     x        y   拟合值  标准误    残差    残差

     4  29.5   57.100   56.084   0.094   1.016    2.85R

    16  42.9   81.200   80.400   0.062   0.800    2.20R

    37  66.0  123.100  122.317   0.067   0.783    2.16R

    48  78.0  143.187  144.093   0.096  -0.906   -2.54R

 

R 表示此观测值含有大的标准化残差

 

 

Durbin-Watson 统计量 = 2.02262

 

下面是三种残差图,实践中选一种即可

http://s2/mw690/003koJA0zy6IwQJX3cRa1&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

http://s11/mw690/003koJA0zy6IwQKseBcba&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />


 

-------------------------------------------------------------


http://s5/mw690/003koJA0zy6IwQTii3264&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

 

 

http://s12/mw690/003koJA0zy6IwR3wStZab&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

http://s11/mw690/003koJA0zy6IwR4fb2Gfa&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

------------------------------------------------------

 

##  1(Residual VS. Fitted):y轴是残差,x轴是拟合值。理想情况下,这个图看起来就像晴朗的夜空(residual“星星点点”的,没有规律性)。

##  2(Normal Q-Q):如果errors是正态分布的话,会近乎呈现一条直线。如果是S-形,或者香蕉形,就需要拟合其它模型了。也可用函数qqnorm( )分析之。

##  3(Scale-location):跟第一幅图相同,但刻度(y轴)变了。如果出现问题的话,比如方差随均值而增大,图中的点就会分布在一个三角形的内部,residuals的点随着fitted values的增大而增大。

##  4(Residual VS. Leverage):显示Standardized residuals leverage的函数,还给出了响应变量的每个观测值的Cook 距离。这个图的point highlight那些对参数估计有最大影响的y

## leverage 的定义 http://blog.sina.com.cn/s/blog_b5c8908c0101ck2x.html

## durbinWatsonTest()要调用 car package,但同时需要 MASS 和 nnet 两个package。

 

-----------------------------------------------------

 

http://s4/mw690/003koJA0zy6IwRYPAmTa3&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

 

-----------------------------------------------以下是输出结果(关心红色字体内容)----------------------------------

 

Variables Entered/Removeda

Model

Variables Entered

Variables Removed

Method

1

xb

.

Enter

`

a. Dependent Variable: y

b. All requested variables entered.

 

 

Model Summaryb

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

Durbin-Watson

1

1.000a

1.000

1.000

.36869

1.984

 

a. Predictors: (Constant), x

b. Dependent Variable: y

 

ANOVAa

Model

Sum of Squares

df

Mean Square

F

Sig.

1

Regression

41548.431

1

41548.431

305662.569

.000b

Residual

6.525

48

.136

 

 

Total

41554.955

49

 

 

 

a. Dependent Variable: y

b. Predictors: (Constant), x

 

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

2.562

.183

 

14.025

.000

x

1.814

.003

1.000

552.868

.000

a. Dependent Variable: y

 

Residuals Statisticsa

 

Minimum

Maximum

Mean

Std. Deviation

N

Predicted Value

49.3756

148.4472

99.3360

29.11919

50

Residual

-.89243

1.01077

.00000

.36490

50

Std. Predicted Value

-1.716

1.687

.000

1.000

50

Std. Residual

-2.421

2.742

.000

.990

50

a. Dependent Variable: y

 

http://s7/mw690/003koJA0zy6IwS98sqG76&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

-------------------------------------以上是SPSS输出结果------------------------------------------

 

 

http://s7/mw690/003koJA0zy6IwSfKvkO46&690(excel,R 软件,SPSS,stata,Minitab)" TITLE="一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)" />

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有