加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

320.Cook's D

(2014-10-30 05:11:01)
标签:

cookd

regression

reg

g3d

3dsurface

分类: 统计分享

Introduction

庫克距離(Cook’s distance)。

Cook's distance measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression. 

http://upload.wikimedia.org/math/9/7/a/97af86602cb21747b328402e8f90a022.pngD" />(formula 1)

In SAS, the denominator above is replaced by p*MSE instead. In the formula, Cook's distance is represented in terms of changes to the estimates of the regression parameters between the cases where the particular observation is either included or excluded from the regression analysis.

 

In my program a number (= N) of subset models on deleting one observation every time were refitted and the β’s were given. Of course, SAS computes Cook’s D values in a much efficient way: using hat matrix H instead of refitting the model many times.

http://upload.wikimedia.org/math/b/c/b/bcb0a29582880fa53973a9d51c090286.pngD" />(formula 2)

所以,在这里的程序专注于公式一,更多只是用于展示的作用。事实上,SAS计算类似于公式二,利用HAT距阵,无须重复建立模型.庫克距離和偏离值不一样,该统计量用来检验某观测对模型的影响程度。详细见:http://en.wikipedia.org/wiki/Cook's_distance

Program & Result

Proc REG contains Cook's D in influence statistics. It provides Cook’s D plot as well in its statistical ODS Graph. ID name can label the outliers in the plot by name variable. SSCP matrix is XTX.MSE option in MODEL statement writes model mse value to OUTEST= table. In the model contains two predictors of Age and Height thus p = 2+1 (intercept) = 3. 

proc reg data=sashelp.class outest=est outsscp=sscp  plots(only label)=(RStudentByLeverage CooksD);

  id name;

  model weight =age height/mse influence ;

  output out=CookDout cookd =cookd;

run; quit

The macro %fitmdls is to refit the subset models on N-i observations, i.e, deleting ith observation from the data set. The OUTEST= table contains parameter estimates.

%macro fitmdls( ind);

  proc reg data=t outest=esti(keep =intercept age height) noprint;

    where ind ^=&ind;

    model weight =age height/mse;

    run; quit;

  proc append base =est0 data =esti; run;

%mend fitmdls;

A temporary table t was created. In the table, the observation number is marked for refitting the model.

data t;  set sashelp.class;  ind ++1; run;

Models can be called on each observation in data step. The first step is to generate an empty data set. Step by step, the data set will be appended all the sub-models’ parameter estimates.

data _null_;

  set t(keep=ind);

  if _n_ =1 then call execute('data est0;length intercept age height 8.; stop; run;');

  call execute('%fitmdls('||cats(ind)||')');

run;

The following two macros are used to do matrix calculation in data step. The first one is for matrix multiplication and second one matrix transpose. Of note: matrix calculation is inefficient in data step since that will be involved with a lot of loops, not least that data step is only capable of doing very basic matrix calculations.

%macro mult(mat1, mat2, mat0);

  call missing(of &mat0[*]);

  do i =1 to dim1(&mat1);

    do j =1 to dim2(&mat2);

      do k =1 to dim2(&mat1);

        &mat0[i,j] ++(&mat1[i,k]*&mat2[k,j]);

      end;

    end;

  end;

%mend mult;

%macro transpose(mat1, mat0);

  do i =1 to dim1(&mat1);

    do j =1 to dim2(&mat1);

      &mat0[j,i] =&mat1[i,j];

    end;

 end;

%mend transpose;

data cookd; keep cookd;

From formula 1, Cook’s D needs parameter estimates(β), SSCP matrix(XTX), number of estimates(p), and MSE in the original full model with complete data.

  if _n_ =1 then set est(keep =intercept age height _p_ _mse_ rename =(intercept =intercept0 age  =age0 height =height0));

  array sscpm[3,3] _temporary_;

  if _n_ =1 then do p =1 to nobsx;

    set sscp point=p nobs=nobsx;

    if _type_ ='SSCP' then if upcase(_name_) in ('INTERCEPT', 'AGE', 'HEIGHT') then do;

        sscpm[p,1] =intercept; sscpm[p,2] =age; sscpm[p,3] =height;

    end;

  end;

  set est0 end=Eof;

The following code is to compute the Cook’s D. Again it is doable in data step only because the sample size is small.

  array dbeta[3,1] _temporary_; array dbetat[1,3] _temporary_;  array out2m[1,3] _temporary_; array result[1,1] _temporary_;

  dbeta[1,1] =-intercept+intercept0;  dbeta[2,1] =-age      +age0;    dbeta[3,1] =-height   +height0;

  %transpose(dbeta, dbetat);  %mult(dbetat, sscpm , out2m );  %mult(out2m , dbeta , result);

  cookD =result[1,1]/(_p_ *_mse_);  label cookd ="Cook's D";

  run;

http://s6/mw690/002ZOYCigy6NcVi1XU1f5&690D" TITLE="320.Cook's D" />

Figure 2: Calculated Cook's Distance in the ordinary linear regression model. Each observation has its Cook's D statistic. A larger Cook's D, the greater the influence of the record. The largest value of Cook's D is on observation 17, which is much exceeding the cutoff we define as 4/N = 4/19 = 0.2 here.

Compute cut-off for Cook’s D.

People may use different cut-off for Cook’s D. For example, some of them use 1 or 1/(N+p), etc. According to SAS, cut-off = 4 /N, based on the sample size.

data CookDout2;

  call symputx('cookd_ref', 4/nobsx);  if 0 then set t(drop =_all_) nobs =nobsx;

  merge sashelp.class cookd end =Eof;

  Observation =_n_;  if cookd >4/nobsx then id_ref =name;

run;

Cook’s D plot in SAS is based on needle plot. X = Observation and Y = Cook’s D. As stated above, if ID variable was given, the outliers are labeled.

proc sgplot data =CookDout2;

  needle x =Observation y =cookd/baseline =0 datalabel =id_ref markers markerattrs =(symbol=circle);  refline &cookd_ref/axis =y;

run;

http://s13/mw690/002ZOYCigy6NcVido9Ccc&690D" TITLE="320.Cook's D" />

 Figure 3: Robert has the greatest effect on the model. From the data table, Robert was 12 years old (young)but 128 pounds in weight (heavy) and 65 inch high. It sounds like sort of abnormal.

3D surface plot for data.

3D plot contain 2-dimension predictor space (X-Y) and 1-dimension response space (Z). PROC TEMPLATE creates a plot template; PROC G3GRID provides a derived data points required in 3D surface plot; and PROC SGRENDER renders the template into graph.

proc template;

  define statgraph surfaceplotparm;

    begingraph;

      entrytitle "SURFACECOLORGRADIENT=WEIGHT";

      layout overlay3d / cube=false;

      surfaceplotparm x=age y=height z=weight/surfacetype=fill surfacecolorgradient=weight colormodel=twocolorramp reversecolormodel=true ;

      endlayout;

    endgraph;G

  end;

run;

proc g3grid data =sashelp.class out =spline;  grid age*height = weight / naxis1=75 naxis2=75 spline; run;

proc sgrender data =spline template =surfaceplotparm; run;

http://s1/mw690/002ZOYCigy6NcVisYWQd0&690D" TITLE="320.Cook's D" /> 

Figure 4: From 3D surface plot about the data in the model, a bumpy point with (age=12, height=65, weight=128, it is Robert) was observed from the surface. This point causes the surface changes abruptly. Probably you want to remove this point from the model. In fact, many people will do that. However, simply deleting influential points from the model may fail in many cases.

庫克距離指出的点即为预测平面上的突兀点.针对该点所采取的措施因人而不同.似乎没有一个既定的答案或者最为合理的解决方案.

2-dimension display data by series plot on each Age class.

proc sort data =sashelp.class out =dataplot; by age height weight;

proc sgplot data =dataplot;

  series  x =height y =weight/group =age datalabel =name markers;  xaxis grid; yaxis grid;

run;

 

http://s11/mw690/002ZOYCigy6NcVhRB7Y4a&690D" TITLE="320.Cook's D" />

Figure 1: Robert in 12 has the height in the same range of 14-year-old kids. That is kind of abnormal in term of his age and height. It is not surprised that Cook’s D identifies him as an outlier.








0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有