加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

理解SAS的BY语句和CLASS语句的区别与联系?

(2014-04-07 14:55:38)
分类: 容易混淆的概念
BY语句在过程中一般用来指定一个或几个分组变量,根据这些分组变量值把观测分组,然后 对每一组观测分别进行本过程指定的分析。在使用带有BY语句的过程步之前一般先用SORT过 程对数据集排序。

在一些过程(如方差分析)中,使用CLASS语句指定一个或几个分类变量,它实际 相当于因变量。而在另一些过程(如MEANS)中,CLASS语句作用与BY语句类似,可以指定分 类变量,把观测按分类变量分类后分别进行分析。使用CLASS时不需要先按分类变量排序。

如果不要求在数据集中输出结果,那么两者几乎没有什么区别。
但是若要求在数据集中输出结果则两者的输出结果有些不同。具体你可以在运行下面两组程序后,再在sas左边小窗口explorer的librarities中打开数据集b和c看到:
proc sort data=a;by xibao;run;(a为原始数据集)
proc means;by xibao;var y;output out=b n=n0 mean=m1 max=ma min=mi;run;

proc means;by xibao;classr y;output out=c n=n0 mean=m1 max=ma min=mi;run;
c中多了一个合计项。

在SAS过程里,BY 和 CLASS 是两个完全不同的概念,即使在某些地方结果或许有巧合。BY 的本质是划分切割数据本身(SUBSET DATA)。CLASS 是对变量值做归属区分(CLASSIFY VARIABLE)。举个例子。京剧

proc means data =sashelp.class; by sex notsorted; var age; run;
proc means data =sashelp.class; class sex; var age; run;
两个过程的结果根本就不同。
关于效率:在某些过程中,如果提前对class变量排序,那么也许可以提高该过程的执行效率。如果要我指出BY的近亲,我到认为STRATA STATEMENT有些相似性,如果它存在的话。京剧

We were having a (admittedly academic) discussion on the differences between using class versus by in a proc means statement. Performance issues aside, are there any differences? Some of colleagues vaguely recalled something about missing values being treated differently, but we couldn't reproduce this. Are there differences (again, performance aside), or did we remember incorrectly? Thanks.



  • I don't think there are any numerical/statistical differences. I find the CLASS statement convenient when I want to see all of the output in a single table; the BY group approach puts each BY group statistics on a separate page.  Also, you need to SORT the data to use the BY group, but not to use the CLASS stmt. The BY group approach is more efficient when the data are sorted, and requires less memory. The output data sets also look different for the two approaches.


    • Rick, you have it exactly right.  I just wanted to expound upon one of your points.

       

      Comparing CLASS STATE COUNTY; vs. BY STATE COUNTY;

       

      In the output data set using BY, there is one observation for each STATE/COUNTY combination.

       

      In the output data set using CLASS, you get those same observations, plus:  one observation holding a summary for the entire data set, one set of observations holding a summary for each STATE, and another set of observations holding a summary for each COUNTY.  The variable _TYPE_ in the output data sets tells you what the level of summarization is for that observation.

       

      The printed reports give you summaries at the most detailed level only, even if the output data sets would be different.  And, as Rick noted, the format of the reports would change.

       

      Finally, your colleague's recollection is correct.  Any observation where a CLASS variable is missing will be thrown out of the analysis.  The MISSING option changes that, treating missing values like any other value for a CLASS variable.


      • As a general rule where you are dealing with large datasets (> 1GB) and there are many distinct values of the class variables, I have often found SAS will process faster using BY rather than CLASS even with the SORT time added in as well. If your data is already sorted in the right order then the benefit is even greater.

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有