一元线性回归分析 (excel,R 软件,SPSS,stata,Minitab)

标签:
一元线性回归excelminitabr软件spss |
一元线性回归 Simple Linear Regression
Regression with a Single Continuous Explanatory Variable
[x解释变量explanatory variable(also called a predictor or independent variable); y结果变量outcome variable (also called a response or dependent variable); a截距intercept, is the predicted value of y when x=0; b斜率slop, is the predicted change in y for a 1 unit change in x; 预测值/计算值,xi观测值,yi观测值,ε残差residuals,εi = 观测yi -预测Yi]
We make the following assumptions about the residuals εi : (关于残差的3个假设)
The residuals are normally distributed with zero mean and variance
σ2 (spoken as ‘sigma-squared’). This assumption is often
written in shorthand as εi ~ N(0,σ2
)
...........
The variance of the residuals is constant, whatever the value of
x. This means that if we
take a slice through the scatter plot of y versus x at any particular value of
x, the y values have approximately the
same variation as at any other value of x. If the variance is constant,
we say the residuals are homoskedastic. Otherwise they
are said to be heteroskedastic.
The residuals are not correlated
with one another, i.e. they are independent. Correlations might
arise if some individuals contribute more than one observation
(e.g. repeated measures) or if individuals are clustered in some
way (e.g. in schools). If it is suspected that residuals are
correlated, the regression model needs to be modified, e.g. to a
multilevel
model.
...........
Important note[重要提示]: We cannot claim that there is a causal relationship between X and Y from such a simple model or, indeed, from any regression model applied to observational data. So when interpreting the slope it is better to avoid statements like ‘a change in X leads to or causes an increase in Y’. Taking account of other factors would provide stronger evidence of a causal relationship if the original relationship did not change as additional predictors are included in the model.
Explained and unexplained variance and R-squared
[能解释和不能解释的变异,以及R的平方]
All statistical models have a common form 所有统计模型都可以用以下通式表达:
Response = Systematic part + Random part
where for a simple regression the systematic part is a + bxi and the random part is the residual εi.
The systematic part gives the average
relationship between the response and the predictor(s),
while the random part is what is
left over (the unexplained part) after taking account of
the included
predictor(s).
The term random means ‘allowed to
vary’.
The residual is the difference between the actual and predicted value. In some cases there will be a close fit between the actual and fitted values. In other cases there may be a lot of ‘noise’ (e.g. if, for any given age, there is a wide range of hedonism scores). It is helpful to characterise this residual variability. To do so requires us to make some assumptions about the residuals (normality and homoskedasticity). Under these assumptions we can summarise the variability in a single statistic, the variance of the residuals σ2. We can think of the residual variance as the part of the variance in Y that is unexplained by X. The part of the variance in Y that is explained by X (the systematic part of the model) is called the explained variance in Y.
Another key summary statistic is the R-squared
(R2) value which gives the correspondence between the
actual and fitted values, on a scale between zero (no correspondence) and 1
(complete correspondence).
R-squared can also be interpreted as the proportion of the total variance in Y
that can be explained by variability in
X.
In the case of simple regression, R-squared is the square of the Pearson correlation coefficient.
Hypothesis testing 假设检验
H0: b=0
In the practice sections, we will generally use the
Z-ratio (also called t-ratio) to test
significance rather than calculating confidence
intervals.
Model checking
To check assumptions about εi, we use the
estimated residuals
which are the differences between the observed and predicted values
of y: redidual
i
=
We usually work with the standardized residuals ri which we obtain by dividing residual I by their standard deviation.
Checking the normality assumption
We can check whether residuals are normally distributed by looking at a histogram or a normal probability plot of the standardized residuals. If the normality assumption holds, the points in a normal plot should lie on a straight line.
用标准化残差的柱状图或正态概率图检查残差是否服从正态分布. 如果是正态分布, 正态图上的点分布在一条直线上.
Checking the homoskedasticity assumption
To check that the variance of the residuals is fairly constant across the range of X, we can examine a plot of the standardized residuals against X and check that the vertical scatter of the residuals is roughly the same for different values of X with no ‘funnelling’.
通过检查 标准化残差 VS. X
Outliers
We can also check for outliers using any of the above residual plots. An outlier is a point with a particularly large residual. We would expect approximately 95% of the residuals to lie between –2 and +2.
还可以从以上任何残差图中检查异常值. 异常值是那些残差特别大的点. 期望约95%的残差居于正负2之间.
Of major interest, however, is whether an outlier has undue influence on our results. For example, in simple regression, an outlier with very large values on X and Y could push up a positive slope. A straightforward way to judge the influence of an outlier is to refit the regression line after excluding it. If the results are very similar to those based on all observations, we would conclude that the outlier does not have undue influence. An observation’s influence can also be measured by a statistic called Cook’s D.
通过统计量Cook's D来识别那些有异常(undue)影响的异常值的影响有多大.
数据 (在excel中单列, 太长, 修改成如下形式, 少占地方)
y |
x |
y |
x |
y |
x |
y |
x |
y |
x |
49.5 |
25.8 |
70.4 |
37.4 |
90.3 |
48.6 |
111.3 |
59.6 |
130.2 |
70.4 |
52.0 |
27.5 |
72.1 |
38.5 |
92.4 |
49.5 |
112.3 |
60.5 |
132.3 |
71.5 |
54.4 |
28.7 |
74.4 |
39.6 |
94.1 |
50.8 |
114.3 |
61.6 |
134.3 |
72.5 |
57.1 |
29.5 |
76.4 |
40.6 |
95.7 |
51.7 |
116.3 |
62.7 |
136.2 |
73.7 |
58.4 |
30.8 |
78.5 |
41.9 |
98.4 |
52.8 |
118.3 |
63.8 |
139.0 |
74.8 |
59.7 |
31.7 |
81.2 |
42.9 |
100.5 |
53.9 |
120.3 |
64.9 |
140.4 |
75.9 |
62.4 |
33.0 |
82.3 |
44.0 |
102.2 |
55.0 |
123.1 |
66.0 |
142.2 |
77.0 |
65.2 |
34.2 |
84.1 |
45.1 |
104.5 |
56.1 |
124.5 |
67.1 |
143.2 |
78.0 |
66.4 |
35.2 |
86.3 |
46.2 |
106.2 |
57.3 |
126.3 |
68.2 |
146.4 |
79.2 |
68.5 |
36.3 |
88.3 |
47.3 |
108.4 |
58.3 |
127.5 |
68.7 |
148.1 |
80.4 |
1 用excel 2007分析
在“数据”下拉菜单,可以找到“数据分析”选项框,左击之显示“分析工具A”,在里面找到“回归”分析工具:
http://s1/bmiddle/003koJA0zy6IwPsBQyI70&690(excel,R
Y就是dependent or response variable, X 就是independent, explanatory, or predictor variable,要选中或者输入它们所在的区域。
标志:选中时,指定第一行是变量名
常数为零:选中时,拟合的模型没有截距。
置信度:默认值是95%。
输出选项:用于指定输出位置
残差:用于指定输出的残差图种类。
正态分布:选中后将给出正态概率图。
完成各个选项后,点击确定,即可得到相应的表格(SUMMARY OUTPUT, RESIDUAL OUTPUT, PROBABILITY OUTPUT),残差图等。
-------------------------------
http://s12/mw690/003koJA0zy6IwPZH52Pdb&690(excel,R
http://s1/mw690/003koJA0zy6IwQ0kWJOe0&690(excel,R
http://s16/mw690/003koJA0zy6IwQ0QGwn5f&690(excel,R
------------------------------------
1.1 相关系数检验
相关系数(Multiple R)非常接近1,R=0.99992>R(0.05,48)=0.27871,检验通过。[查相关系数检验表,显著水平a=0.05、剩余自由度df = 48时,R=0.27871]
1.2 标准误差检验
s=0.3689,从而得到变异系数cv
= [s
1.3 DW检验
计算统计量 D = 2.02,查表得到dU= 1.59,dL = 1.50,dU<D<(4- dU),所以无序列相关。
http://s5/mw690/003koJA0zy6IwQdgyu8a4&690(excel,R
1.4 残差图
残差: 观测值 (y) 与其相应拟合值 (ŷ) 之间的差。残差值在回归和方差分析过程中特别有用,因为残差值表示模型能在多大程度上解释观测数据中的变异。
标准化残差: 有助于检测异常值。标准化残差等于残差值 ei 除以其标准差的估计值。通常将大于 2 和小于 -2 的标准化残差视为较大。标准化残差很有用,因为原始残差包含异方差,因此无法很好地指示异常值:相应 X 值远离 其均值 的残差的方差比相应 X 值接近 其均值 残差的方差要大。将此异方差的对照物标准化,所有标准化残差就具有相同的标准差。标准化残差也称为内部 t 化残差。
t化删后残差: 有助于检测异常值。计算观测值的 t 化删后残差的方法是将观测值的删后残差除以其标准差的估计值。删后残差 di 是 yi 与其在模型中的拟合值之差,该拟合值在计算中忽略了第 i 个观测值。忽略观测值是为了确定没有此潜在异常值时模型的行为。如果观测值的 t 化删后残差较大(如果其绝对值大于 2),则它可能是数据中的异常值。每个 t 化删后残差都服从具有 (n – 1 – p) 个自由度的 t 分布,其中 p 等于回归模型中的项数。t 化删后残差也称为外部 t 化残差或删后 t 残差。
---------------------下图是自己动手画的, 可能是版本有问题, 自动输出的图没法看---------------------
http://s4/mw690/003koJA0zy6IwQobV1V03&690(excel,R
--------------------------------------------------------------------------------------------------
2 用Minitab分析
可以从Excel表中直接把数据拷贝到Minitab的工作表里。然后选择“统计”菜单里的“回归”选项,排在最上面的那个图标是简单线性回归。选中后如下图所示:
http://s10/mw690/003koJA0zy6IwQxQbsR19&690(excel,R
选项(N):里面有统计量
图形(G):供选的残差图有正规/标准化/删后三种
选择好后点击“确定”,输出结果见下面的蓝色字体部分(在“会话”窗,点右键,选择“发送节到Microsft Word”)。结果跟Excel的分析结果是一致的。数据结果后面附上了三种残差图。
回归分析:y 与 x
回归方程为
y = 2.55 + 1.81 x
自变量
常量
x
S = 0.368925
方差分析
来源
回归
残差误差
合计
异常观测值
观测值
R 表示此观测值含有大的标准化残差
Durbin-Watson 统计量 = 2.02262
下面是三种残差图,实践中选一种即可
http://s2/mw690/003koJA0zy6IwQJX3cRa1&690(excel,R
http://s11/mw690/003koJA0zy6IwQKseBcba&690(excel,R
-------------------------------------------------------------
http://s5/mw690/003koJA0zy6IwQTii3264&690(excel,R
http://s12/mw690/003koJA0zy6IwR3wStZab&690(excel,R
http://s11/mw690/003koJA0zy6IwR4fb2Gfa&690(excel,R
------------------------------------------------------
##
##
##
##
## leverage 的定义 http://blog.sina.com.cn/s/blog_b5c8908c0101ck2x.html
## durbinWatsonTest()要调用 car package,但同时需要 MASS 和 nnet 两个package。
-----------------------------------------------------
http://s4/mw690/003koJA0zy6IwRYPAmTa3&690(excel,R
-----------------------------------------------以下是输出结果(关心红色字体内容)----------------------------------
Variables Entered/Removeda |
|||
Model |
Variables Entered |
Variables Removed |
Method |
1 |
xb |
. |
Enter |
`
a. Dependent Variable: y |
b. All requested variables entered. |
Model Summaryb |
|||||
Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
Durbin-Watson |
1 |
1.000a |
1.000 |
1.000 |
.36869 |
1.984 |
a. Predictors: (Constant), x |
b. Dependent Variable: y |
ANOVAa |
||||||
Model |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
|
1 |
Regression |
41548.431 |
1 |
41548.431 |
305662.569 |
.000b |
Residual |
6.525 |
48 |
.136 |
|
|
|
Total |
41554.955 |
49 |
|
|
|
|
a. Dependent Variable: y |
||||||
b. Predictors: (Constant), x |
Coefficientsa |
||||||
Model |
Unstandardized Coefficients |
Standardized Coefficients |
t |
Sig. |
||
B |
Std. Error |
Beta |
||||
1 |
(Constant) |
2.562 |
.183 |
|
14.025 |
.000 |
x |
1.814 |
.003 |
1.000 |
552.868 |
.000 |
|
a. Dependent Variable: y |
Residuals Statisticsa |
|||||
|
Minimum |
Maximum |
Mean |
Std. Deviation |
N |
Predicted Value |
49.3756 |
148.4472 |
99.3360 |
29.11919 |
50 |
Residual |
-.89243 |
1.01077 |
.00000 |
.36490 |
50 |
Std. Predicted Value |
-1.716 |
1.687 |
.000 |
1.000 |
50 |
Std. Residual |
-2.421 |
2.742 |
.000 |
.990 |
50 |
a. Dependent Variable: y |
http://s7/mw690/003koJA0zy6IwS98sqG76&690(excel,R
-------------------------------------以上是SPSS输出结果------------------------------------------
http://s7/mw690/003koJA0zy6IwSfKvkO46&690(excel,R