320.Cook's D

标签:
cookdregressionregg3d3dsurface |
分类: 统计分享 |
Introduction
庫克距離(Cook’s distance)。
Cook's distance measures the effect of deleting a given
observation. Data points with large residuals (outliers)
and/or high
In SAS, the denominator above is replaced by
p*MSE instead. In the formula, Cook's distance
is
In my program a number (= N) of subset models on deleting one observation every time were refitted and the β’s were given. Of course, SAS computes Cook’s D values in a much efficient way: using hat matrix H instead of refitting the model many times.
- http://upload.wikimedia.org/math/b/c/b/bcb0a29582880fa53973a9d51c090286.pngD" />(formula 2)
-
- 所以,在这里的程序专注于公式一,更多只是用于展示的作用。事实上,SAS计算类似于公式二,利用HAT距阵,无须重复建立模型.庫克距離和偏离值不一样,该统计量用来检验某观测对模型的影响程度。详细见:http://en.wikipedia.org/wiki/Cook's_distance
Program & Result
Proc REG contains Cook's D in influence statistics. It provides
Cook’s D plot as well in its statistical ODS Graph. ID name can
label the outliers in the plot by name variable. SSCP matrix is
XTX.MSE option in MODEL statement writes model mse value
to OUTEST= table. In the model contains two predictors of Age and
Height thus p = 2+1 (intercept) =
3.
proc
reg
data=sashelp.class
outest=est
outsscp=sscp
run;
quit;
The macro %fitmdls is to refit the subset models on N-i observations, i.e, deleting ith observation from the data set. The OUTEST= table contains parameter estimates.
%macro fitmdls( ind);
%mend fitmdls;
A temporary table t was created. In the table, the observation number is marked for refitting the model.
data
t;
Models can be called on each observation in data step. The first step is to generate an empty data set. Step by step, the data set will be appended all the sub-models’ parameter estimates.
data _null_;
run;
The following two macros are used to do matrix calculation in data step. The first one is for matrix multiplication and second one matrix transpose. Of note: matrix calculation is inefficient in data step since that will be involved with a lot of loops, not least that data step is only capable of doing very basic matrix calculations.
%macro mult(mat1, mat2, mat0);
%mend mult;
%macro transpose(mat1, mat0);
%mend transpose;
data cookd; keep cookd;
From formula 1, Cook’s D needs parameter estimates(β), SSCP matrix(XTX), number of estimates(p), and MSE in the original full model with complete data.
The following code is to compute the Cook’s D. Again it is doable in data step only because the sample size is small.
http://s6/mw690/002ZOYCigy6NcVi1XU1f5&690D" TITLE="320.Cook's
Figure
2:
Compute cut-off for Cook’s D.
People may use different cut-off for Cook’s D. For example, some of them use 1 or 1/(N+p), etc. According to SAS, cut-off = 4 /N, based on the sample size.
data CookDout2;
run;
Cook’s D plot in SAS is based on needle plot. X = Observation and Y = Cook’s D. As stated above, if ID variable was given, the outliers are labeled.
proc sgplot data =CookDout2;
run;
http://s13/mw690/002ZOYCigy6NcVido9Ccc&690D" TITLE="320.Cook's
3D surface plot for data.
3D plot contain 2-dimension predictor space (X-Y) and 1-dimension response space (Z). PROC TEMPLATE creates a plot template; PROC G3GRID provides a derived data points required in 3D surface plot; and PROC SGRENDER renders the template into graph.
proc template;
run;
proc
g3grid
data
=sashelp.class
out
=spline;
proc sgrender data =spline template =surfaceplotparm; run;
http://s1/mw690/002ZOYCigy6NcVisYWQd0&690D" TITLE="320.Cook's
Figure 4: From 3D surface plot about the data in the model, a bumpy point with (age=12, height=65, weight=128, it is Robert) was observed from the surface. This point causes the surface changes abruptly. Probably you want to remove this point from the model. In fact, many people will do that. However, simply deleting influential points from the model may fail in many cases.
为庫克距離指出的点即为预测平面上的突兀点.针对该点所采取的措施因人而不同.似乎没有一个既定的答案或者最为合理的解决方案.
2-dimension display data by series plot on each Age class.
proc sort data =sashelp.class out =dataplot; by age height weight;
proc sgplot data =dataplot;
run;
http://s11/mw690/002ZOYCigy6NcVhRB7Y4a&690D" TITLE="320.Cook's
Figure 1: Robert in 12 has the height in the same range of 14-year-old kids. That is kind of abnormal in term of his age and height. It is not surprised that Cook’s D identifies him as an outlier.