SAS 数据汇总全攻略(过程步+SQL+HASH)
(2014-04-07 12:16:36)| 分类: 过程步程序 |
http://www.stattutorials.com/SAS/index.html
http://blog.sina.com.cn/s/blog_61c463090100ljtq.html
http://www2.sas.com/proceedings/sugi30/263-30.pdf
http://www2.gliet.edu.cn/mathmodel/Upload/limitFiles/yjzy/SASczrm.pdf
http://www.ats.ucla.edu/stat/sas/library/nesug00/p068.pdf
Question:
有一个数据(SAS造出来的),含有分类字段Class1,需要汇总的变量X和Y。数据产生过程如下:
data Sample;
run;
很明显,数据sample 并没有按照变量class1排序。
下面探讨四种方法的具体做法:
1.Sort +
data步
run;
data out;
run;
该方法较为普通,且较为浪费时间,但是可以个性化定制,如不仅可以求简单加总,还可以做一些其他运算。
2. Proc summary
proc summary data=sample nway;
run;
该方法速度较快,但是较为死板,无法做运算,如求两个变量的乘积之和,变量的平方和等,需要之前加一个data步,但是该过程不的好处就是不需要排序。
data a;
input x y;
cards;
1 2
1 2
;
proc means data=a;
output out=b mean(x)=mx sum(x)=sx mean(y)=my
sum(y)=sy;
proc print data=b;
run;
3. SQl
proc sql;
quit;
对于小数据来说,该方法非常便捷,而且不需要排序,也可以计算数据。但是其缺点是对于大数据的处理相当浪费时间和缓存。
4. Hash
Object
data _null_;
run;
该方法是目前为止,作者见到的最优效率的方法,不仅可以计算也不需要排序,效果甚至比proc
summay还要快,但是其缺点就是比较耗内存,但是对于上千万的记录1G内存也就足够了。
四种方法的比较
| 对大数据的处理速度 | 占用内存情况 | 占用缓存情况 | I/O使用 | 需要排序 | |
| Data step | 最慢 | 较小 | 较小 | 最大 | 是 |
| summary | 较快 | 较小 | 较小 | 一般 | 否 |
| SQl | 较慢 | 较小 | 大 | 一般 | 否 |
| Hash | 最快 | 大 | 较小 | 一般 | 否 |
Sql是基于关系数据库理论,means是基于SAS 统计理论,rdbms和sas术语之间的关系如下,
Sas data set 对应 rdbms table
Sas observations 对应 rdbms rows
Sas variables 对应 rdbms columns
1、输入数据源
在sas means过程步中,设置数据源data = source,在sql 过程步中,用from table name
输出设置
两个过程步都可以生成输出窗口、数据集和表。但是,在menas过程步中,输出窗口中的数据显示的格式和sas数据集中产生的数据不同,而且在sql中是一样的。
2、默认统计信息
默认情况下,means过程步产生数据集中的所有数值变量的计数,均值,标准偏差,最大值和最小值。Means过程步不能在字符型变量上执行统计信息。如果数据集中没有数值变量,means过程步仅仅生成统计计数n。
默认情况下,sql过程步不产生任何统计信息。但是,sql过程步能够在字符型变量上计算统计信息。例如:
Proc SQL;
Select max(sex),Min(sex),n(sex),nmiss(sex)
From sashelp.class;
3、编码比较
(1)计算简单统计信息
Means使用:
PROC MEANS DATA=SASHELP.CLASS NONOBS MAXDEC=2 SUM MEAN STD ;
VAR AGE HEIGHT;
RUN;
Sql使用:
PROC SQL;
select 'Age' as Variable,
sum(age) as Sum format 10.2,
avg(age) as Mean format 10.2,
std(age) as std format 10.2 label 'Std Dev'
from sASHELP.CLASS
union
select 'Height' as Variable,
sum(height) as Sum format 10.2,
avg(height) as Mean format 10.2,
std(height) as std format 10.2 label 'Std Dev'
from sASHELP.CLASS;
QUIT;
4、运用group by和class进行分组处理
(1)means
PROC MEANS DATA=SASHELP.CLASS NONOBS MAXDEC=2 SUM MEAN STD ;
CLASS sex;
VAR AGE HEIGHT;
RUN;
(2)sql
PROC SQL;
select sex as Sex,
'Age' as Variable,
sum(age) as Sum format 10.2,
avg(age) as Mean format 10.2,
std(age) as std format 10.2 label 'Std Dev'
from SASHELP.CLASS
Group by sex
select sex as Sex,
'Height' as Variable,
sum(height) as Sum format 10.2,
avg(height) as Mean format 10.2,
std(height) as std format 10.2 label 'Std Dev'
from sASHELP.CLASS
group by sex;
quit;
5、在means过程步中,运用class结合type语句分组处理
默认情况下,分组是发生在class变量之后。通过使用type语句可以选择所有或者单独的class变量。
data grade;
input Name $ Gender $ Status $ Year $ Section $ Score FinalGrade @@;
datalines;
Abbott F 2 97 A 90 87 Branford M 1 97 A 92 97
David M 3 99 c 87 96 Crandell M 2 98 B 81 71
Dennison M 1 97 A 85 72 Edgar F 1 98 B 89 80
Nancy F 3 99 B 79 88 Faust M 1 97 B 78 73
Greeley F 2 97 A 82 91 Hart F 1 98 B 84 80
Mick M 3 98 c 77 91 Isley M 2 99 A 88 86
Jasper M 1 97 B 91 93 Ray M 3 97 B 76 90
Mary F 3 97 C 75 79 Nick M 2 98 B 85 89
Billy M 2 98 C 77 83 Taylor F 3 98 A 86 81
Roy M 3 98 B 92 84 Mandy F 3 97 C 88 87
;
run;
proc means data=grade nonobs n mean sum maxdec=2;
class Status Year;
var Score;
types () status*year;
run;
sql处理
proc sql;
select count(*) as N,
avg(score) as Mean format 10.2 ,
std(score) as Sum format 10.2
from grade;
quit;
proc sql;
select status,year,
count(*) as N,
avg(score) as Mean format 10.2 ,
sum(score) as Sum format 10.2
from grade
group by status, year;
quit;
6、剔除不在classdata中的分析变量
data statyear;
input Status $ Year $ @@;
datalines;
1 97 1 98 2 97 2 98 3 97 3 99
;
run;
proc means data=grade nonobs nway n mean sum maxdec=0
classdata=statyear exclusive;
class status year;
var score;
proc sql;
select status,year,
count(*) as N,
mean(score) as Mean format 10.2 ,
sum(score) as Sum format 10.2
from grade
where status||year in (select status||year from statyear)
group by status, year;
quit;
7、从proc means和proc sql中创建输出数据集
在proc means中通过output产生输出数据集
Proc means data=grade nonobs maxdec=2 noprint;
Class status year;
Var score finalgrade;
Output out=sumgrade max=scr_max grade_max mean(score finalgrade) = scr_mean
grade_mean sum=;
format scr_mean grade_mean 10.1;
Run;
proc print data=sumgrade;run;
这样可以产生两个自动变量_type_,_FREQ_。
使用sql处理
PROC SQL;
Create table sumgrade as
Select ' ' as Status, ' ' as Year ,0 as _TYPE_ , count(*) as _FREQ_,
Max(score) as scr_max, max(finalgrade) as grade_mean ,
avg(score) as scr_max format 10.1,
avg(finalgrade) as grade_mean format 10.1,
sum(score) as score, sum(finalgrade) as finalgrade
from grade
union
Select ' ' as Status, Year ,1 as _TYPE_ , count(*) as _FREQ_,
Max(score) as scr_max, max(finalgrade) as grade_mean ,
avg(score) as scr_max format 10.1 ,
avg(finalgrade) as grade_mean format 10.1,
sum(score) as score, sum(finalgrade) as finalgrade
from grade
group by year
union
Select Status, ' ' as Year ,2 as _TYPE_ , count(*) as _FREQ_,
Max(score) as scr_max, max(finalgrade) as grade_mean ,
avg(score) as scr_max format 10.1 ,
avg(finalgrade) as grade_mean format 10.1,
sum(score) as score, sum(finalgrade) as finalgrade
from grade
group by status
union
Select Status, Year ,3 as _TYPE_ , count(*) as _FREQ_,
Max(score) as scr_max, max(finalgrade) as grade_mean ,
avg(score) as scr_max format 10.1 ,
avg(finalgrade) as grade_mean format 10.1,
sum(score) as score, sum(finalgrade) as finalgrade
from grade
group by status,year
order by _TYPE_ ;
quit;
proc print data=sumgrade; run;
8、sql和means比较总结

加载中…