加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

SAS 数据汇总全攻略(过程步+SQL+HASH)

(2014-04-07 12:16:36)
分类: 过程步程序

 总的来讲SAS做汇总的方法有4类,分别为sort+data步,proc summary,SQL和Hash。下面做详细解析。

http://www.stattutorials.com/SAS/index.html

http://blog.sina.com.cn/s/blog_61c463090100ljtq.html

http://www2.sas.com/proceedings/sugi30/263-30.pdf

http://www2.gliet.edu.cn/mathmodel/Upload/limitFiles/yjzy/SASczrm.pdf

http://www.ats.ucla.edu/stat/sas/library/nesug00/p068.pdf

Question:

有一个数据(SAS造出来的),含有分类字段Class1,需要汇总的变量X和Y。数据产生过程如下:

data Sample;
   do i=1 to 1000;
     do class2=1 to 10;
    class1=class2+i;
       x=ranuni(1)*10;
    y=(1-ranuni(1))/max(0.00001,ranuni(1));

    drop i;
   output;
  end;
   end;
run;

很明显,数据sample 并没有按照变量class1排序。

下面探讨四种方法的具体做法:     

1.Sort + data步 
proc sort data=sample;
    by class1 class2;
run;
data out;
    retain SumX SumY 0;
    set sample;
  by class1 class2;
  if first.class1 then do; sumX=0; SumY=0; end;
   sumx+x;  sumY+y;
   drop class2  x y;
  if last.class1 then output;
run;     
该方法较为普通,且较为浪费时间,但是可以个性化定制,如不仅可以求简单加总,还可以做一些其他运算。

2. Proc summary

proc summary data=sample nway;
     class  class1;
  var x y;
  output out=out(drop=_:) sum= sumx sumy;
run;
该方法速度较快,但是较为死板,无法做运算,如求两个变量的乘积之和,变量的平方和等,需要之前加一个data步,但是该过程不的好处就是不需要排序。


sas
proc summary data= econ.c97c02manu;
by county statecode cropyear;
var netacre cropyear cropcode;
output out=econ.c97c02manutotal sum(netacre)=totacre sum(liability)=totliab sum(subsidy)=totsub sum(totprem)=totprem sum(indemnity)=totindem ;
run;

STATA
 egen sumacre=sum(netacre),by  (county statecode cropyear)


这段code的目的是生成面积总和。原始数据中包含每个县每年很多类别的播种面积。这段code就是为了计算每个县每年总的播种面积:

每个county,每年都有netacre这个变量的观测量。而且不同的州有些县的名称是一样的,所以同样要考虑到州(state)这个变量。根据县(county),州(statecode)和年份(cropyear)这三个变量生成每个县每年的总的播种面积。

在SAS 和STATA中,都用到了by, by + 要用来分类的变量。

SAS这段的结果只会保留加总的变量,原变量不会保存。
STATA结果会保存原始变量,但原始数据的每个observation后面会加一个加总的值。


data a;
input x y;
cards;
1 2
1 2
;
proc means data=a;
output out=b mean(x)=mx sum(x)=sx mean(y)=my sum(y)=sy;
proc print data=b;
run;


3. SQl

proc sql;
 create table out as
 select class1
        ,sum(x) as sumX
  ,sum(y) as sumY
 from sample
 group by class1
 order by class1;
quit;
对于小数据来说,该方法非常便捷,而且不需要排序,也可以计算数据。但是其缺点是对于大数据的处理相当浪费时间和缓存。

4. Hash Object
data _null_;
   length sumx sumY class1 8 ;
   if _n_=1 then do;
    dcl hash hh(hashexp:12,ordered:'a');
      hh.definekey('class1');
      hh.definedata('class1','sumx','sumY');
      hh.definedone();
   end;
   call missing(class1,sumx,sumy);
   do while(last^=1);
      set sample end=last;
   rc=hh.find();
   if rc^=0 then sum=0;
   sumx+x;
   sumy+y;
   hh.replace();
   end;
   hh.output(dataset:'output');
run;
该方法是目前为止,作者见到的最优效率的方法,不仅可以计算也不需要排序,效果甚至比proc summay还要快,但是其缺点就是比较耗内存,但是对于上千万的记录1G内存也就足够了。

四种方法的比较

 

  对大数据的处理速度 占用内存情况 占用缓存情况 I/O使用 需要排序
Data step 最慢 较小 较小 最大
summary 较快 较小 较小 一般
SQl 较慢 较小 一般
Hash 最快 较小 一般




Sql是基于关系数据库理论,means是基于SAS 统计理论,rdbms和sas术语之间的关系如下,

Sas data set 对应 rdbms table

Sas observations 对应 rdbms rows

Sas variables 对应 rdbms columns

1、输入数据源

在sas means过程步中,设置数据源data = source,在sql 过程步中,用from table name

输出设置

两个过程步都可以生成输出窗口、数据集和表。但是,在menas过程步中,输出窗口中的数据显示的格式和sas数据集中产生的数据不同,而且在sql中是一样的。

2、默认统计信息

默认情况下,means过程步产生数据集中的所有数值变量的计数,均值,标准偏差,最大值和最小值。Means过程步不能在字符型变量上执行统计信息。如果数据集中没有数值变量,means过程步仅仅生成统计计数n。

默认情况下,sql过程步不产生任何统计信息。但是,sql过程步能够在字符型变量上计算统计信息。例如:

Proc SQL;

Select max(sex),Min(sex),n(sex),nmiss(sex)

From sashelp.class;

3、编码比较

(1)计算简单统计信息

Means使用:

PROC MEANS DATA=SASHELP.CLASS NONOBS MAXDEC=2 SUM MEAN STD ;

VAR AGE HEIGHT;

RUN;

Sql使用:

PROC SQL;

select 'Age' as Variable,

sum(age) as Sum format 10.2,

avg(age) as Mean format 10.2,

std(age) as std format 10.2 label 'Std Dev'

from sASHELP.CLASS

union

select 'Height' as Variable,

sum(height) as Sum format 10.2,

avg(height) as Mean format 10.2,

std(height) as std format 10.2 label 'Std Dev'

from sASHELP.CLASS;

QUIT;

4、运用group by和class进行分组处理

(1)means

PROC MEANS DATA=SASHELP.CLASS NONOBS MAXDEC=2 SUM MEAN STD ;

CLASS sex;

VAR AGE HEIGHT;

RUN;

(2)sql

PROC SQL;

select sex as Sex,

'Age' as Variable,

sum(age) as Sum format 10.2,

avg(age) as Mean format 10.2,

std(age) as std format 10.2 label 'Std Dev'

from SASHELP.CLASS

Group by sex

select sex as Sex,

'Height' as Variable,

sum(height) as Sum format 10.2,

avg(height) as Mean format 10.2,

std(height) as std format 10.2 label 'Std Dev'

from sASHELP.CLASS

group by sex;

quit;

5、在means过程步中,运用class结合type语句分组处理

默认情况下,分组是发生在class变量之后。通过使用type语句可以选择所有或者单独的class变量。

data grade;

input Name $ Gender $ Status $ Year $ Section $ Score FinalGrade @@;

datalines;

Abbott F 2 97 A 90 87 Branford M 1 97 A 92 97

David M 3 99 c 87 96 Crandell M 2 98 B 81 71

Dennison M 1 97 A 85 72 Edgar F 1 98 B 89 80

Nancy F 3 99 B 79 88 Faust M 1 97 B 78 73

Greeley F 2 97 A 82 91 Hart F 1 98 B 84 80

Mick M 3 98 c 77 91 Isley M 2 99 A 88 86

Jasper M 1 97 B 91 93 Ray M 3 97 B 76 90

Mary F 3 97 C 75 79 Nick M 2 98 B 85 89

Billy M 2 98 C 77 83 Taylor F 3 98 A 86 81

Roy M 3 98 B 92 84 Mandy F 3 97 C 88 87

;

run;

proc means data=grade nonobs n mean sum maxdec=2;

class Status Year;

var Score;

types () status*year;

run;

sql处理

proc sql;

select count(*) as N,

avg(score) as Mean format 10.2 ,

std(score) as Sum format 10.2

from grade;

quit;

proc sql;

select status,year,

count(*) as N,

avg(score) as Mean format 10.2 ,

sum(score) as Sum format 10.2

from grade

group by status, year;

quit;

6、剔除不在classdata中的分析变量

data statyear;

input Status $ Year $ @@;

datalines;

1 97 1 98 2 97 2 98 3 97 3 99

;

run;

proc means data=grade nonobs nway n mean sum maxdec=0

classdata=statyear exclusive;

class status year;

var score;

proc sql;

select status,year,

count(*) as N,

mean(score) as Mean format 10.2 ,

sum(score) as Sum format 10.2

from grade

where status||year in (select status||year from statyear)

group by status, year;

quit;

7、从proc means和proc sql中创建输出数据集

在proc means中通过output产生输出数据集

Proc means data=grade nonobs maxdec=2 noprint;

Class status year;

Var score finalgrade;

Output out=sumgrade max=scr_max grade_max mean(score finalgrade) = scr_mean

grade_mean sum=;

format scr_mean grade_mean 10.1;

Run;

proc print data=sumgrade;run;

这样可以产生两个自动变量_type_,_FREQ_。

使用sql处理

PROC SQL;

Create table sumgrade as

Select ' ' as Status, ' ' as Year ,0 as _TYPE_ , count(*) as _FREQ_,

Max(score) as scr_max, max(finalgrade) as grade_mean ,

avg(score) as scr_max format 10.1,

avg(finalgrade) as grade_mean format 10.1,

sum(score) as score, sum(finalgrade) as finalgrade

from grade

union

Select ' ' as Status, Year ,1 as _TYPE_ , count(*) as _FREQ_,

Max(score) as scr_max, max(finalgrade) as grade_mean ,

avg(score) as scr_max format 10.1 ,

avg(finalgrade) as grade_mean format 10.1,

sum(score) as score, sum(finalgrade) as finalgrade

from grade

group by year

union

Select Status, ' ' as Year ,2 as _TYPE_ , count(*) as _FREQ_,

Max(score) as scr_max, max(finalgrade) as grade_mean ,

avg(score) as scr_max format 10.1 ,

avg(finalgrade) as grade_mean format 10.1,

sum(score) as score, sum(finalgrade) as finalgrade

from grade

group by status

union

Select Status, Year ,3 as _TYPE_ , count(*) as _FREQ_,

Max(score) as scr_max, max(finalgrade) as grade_mean ,

avg(score) as scr_max format 10.1 ,

avg(finalgrade) as grade_mean format 10.1,

sum(score) as score, sum(finalgrade) as finalgrade

from grade

group by status,year

order by _TYPE_ ;

quit;

proc print data=sumgrade; run;

8、sql和means比较总结

0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有