SAS数据集中重复数据的处理方法
标签:
sas大数据分析数据分析师数据挖掘 |
测试数据如下:
http://www.cda.cn/uploadfile/image/20170601/20170601070147_56474.png
TARGET 1: 保留不重复数据/保留重复数据
方法1:DATA STEP
proc sort data=ID;
by ID;
run;
dataID_1;
setID;
byID;
if
run;
dataID_2;
setID;
byID;
if
run;
http://www.cda.cn/uploadfile/image/20170601/20170601070115_85030.png
方法2:PROC SQL
proc sql;
create tableid_3as
selecta.*fromID a,
(selectID,count(1)asID_cntfromID
group byID
having
wherea.ID=b.ID;
quit;
proc sql;
create tableid_4as
selecta.*fromID a,
(selectID,count(1)asID_cntfromID
group byID
having
wherea.ID=b.ID;
quit;
http://www.cda.cn/uploadfile/image/20170601/20170601070049_58461.png
方法3:PROC FREQ
proc freqdata=IDnoprint;
tableID /out=id_5 (keep = ID Count where = (Count = 1)) ;
run;
proc freqdata =IDnoprint;
tableID /out=id_6 (keep = ID Count where = (Count > 1)) ;
run;
http://www.cda.cn/uploadfile/image/20170601/20170601070021_63761.png
TARGET 2: 数据集去重
方法1:PROC SORT
proc sortdata=IDnodupkey out=ex1;
byID ;
run;
注:此处使用nodup与nodupkey会产生相同结果,但实际应用中它们存在一定差异,其主要区别在于:
NODUPKEY去除关键字
NODUP去除observation完全相同的记录,但是相同的记录必须相邻
方法2:PROC SQL
proc sql;
create tableex2as
select distinctIDfromID;
quit;
方法3:DATA STEP
proc sort data=ID;
by ID;
run;
dataex3;
setID;
byID;
iffirst.IDthen outputex3;
run;

加载中…