数据抽样又称数据取样,从欲研究的全部样本中抽取一部分样本单位。其基本要求是要保证所抽取的样本单位对全部样本具有充分的代表性。抽样的目的是从被抽取样本单位的分析、研究结果来估计和推断全部样本特性,是科学实验、质量检验、社会调查普遍采用的一种经济有效的工作和研究方法。
1
简单随机抽样(simple random sampling):
每个抽样单位具有相同概率被抽入样本。总体编号方法及随机抽取方法依调查对象而定。
这里的sample
size用的是percentage,即抽样分数(sampling
fraction):指一个样本所包含的抽样单位数占其总体单位数的成数。
http://s8/middle/5d3b177cg8b03f3825d07&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
代码如下:
data EMDATA.VIEW_4O9 /
view=EMDATA.VIEW_4O9;
set
EMSAMPLE.BUYTEST;
run;
*
10%样本抽样,这里因为总体是10000个,因此抽取样本为1000个;
data EMDATA.SMPINPHW;
set
EMDATA.VIEW_4O9;
drop _sample_count_;
if _sample_count_ < 1000 then
do;
if ranuni(12345)*(10001 -
_N_) <= (1000 - _sample_count_) then do;
_sample_count_ + 1;
output;
end;
end;
run;
quit;
2 Nth抽样:
假设总体为N,要抽取的样本数为n,则Nth抽样为,每隔N/n个样本抽样一个。
http://s12/middle/5d3b177cg8b03f6ff353b&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
代码如下:
**得到(0,10)之间的任意一个数,例如3;
data _null_;
nthstart
= floor(ranuni(12345)*10);
call
symput('nthstart',put(nthstart,best12.));
run;
%put &nthstart;
* 如果第N条数据与10除,余数为3,则输出该条数据。
data EMDATA.SMPINPHW;
set EMDATA.VIEW_4O9;
if
mod(_N_, 10) = &nthstart then output;
run;
3 分层随机抽样法(stratified random sampling):
从各个层次或段落分别进行随机抽样或顺序抽样。
http://s5/middle/5d3b177cg744d332463a4&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
下面我们就可以对分层抽样的选项进行设置:
这里,我们一般是根据目标变量来进行分层抽样,例如对样本进行过采样,以增加坏样本浓度等,如下所以,我们将变量respond的status设置为use,即我们通过变量respond进行分层抽样。
http://s15/middle/5d3b177cg744d335c54ce&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
分层抽样主要分为四种:
3.1 比例配置法:
指各区层大小不同时按区层在总体中的比例确定抽样单位数,若各区层大小相同,比例配置结果实际即为相等配置;
按原来的比例进行抽样。例如原来好坏样本比例为10:1,样本的好坏样本比例也有10:1。
http://s7/middle/5d3b177cg8b04058f8216&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
代码如下:
proc freq data=EMDATA.VIEW_4O9 noprint;
format
RESPOND BEST12.;
table
RESPOND / out=EMPROJ.FRQRJPB2 (rename=(count=_npop_
percent=_pctpop_)) missing;
run;
quit;
proc sort data=EMPROJ.FRQRJPB2 out=EMPROJ.FRQRJPB2;
by
descending _npop_;
run;
* Respond=0有923个, Respond=1有77个,然后依此进行抽样.
data EMDATA.SMPINPHW;
set EMDATA.VIEW_4O9;
drop
_n000001 _s000001 _n000002 _s000002;
length
_SFormat1 $200;
drop
_SFormat1;
_SFormat1 = trim(left(put(RESPOND,BEST12.)));
if
_SFormat1 = '0' then do;
_n000001 + 1;
if _s000001 <
923 then do;
if
ranuni(12345)*(9233 - _n000001) <=(923 - _s000001)
then do;
_s000001 + 1;
output;
end;
end;
end;
else if
_SFormat1 = '1' then do;
_n000002 + 1;
if _s000002 <
77 then do;
if
ranuni(12345)*(767 - _n000002) <=(77 - _s000002)
then do;
_s000002 + 1;
output;
end;
end;
end;
run;
3.2 同样大小:
抽样后,好坏样本大小相同,即好坏样本比为1:1。在本例中,要抽1000个样本,则好坏样本都为500。
http://s12/middle/5d3b177cg8b0407e29a3b&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
代码如下:
proc freq data=EMDATA.VIEW_4O9 noprint;
format
RESPOND BEST12.;
table
RESPOND /out=EMPROJ.FRQRJPB2 (rename=(count=_npop_
percent=_pctpop_)) missing;
run;
quit;
proc sort data=EMPROJ.FRQRJPB2 out=EMPROJ.FRQRJPB2;
by
descending _npop_;
run;
* Respond=0和1的样本量都为500.
data EMDATA.SMPINPHW;
set EMDATA.VIEW_4O9;
drop
_n000001 _s000001 _n000002 _s000002;
length
_SFormat1 $200;
drop
_SFormat1;
_SFormat1 = trim(left(put(RESPOND,BEST12.)));
if
_SFormat1 = '0' then do;
_n000001 + 1;
if _s000001 <
500 then do;
if
ranuni(12345)*(9233 - _n000001) <=(500 - _s000001)
then do;
_s000001 + 1;
output;
end;
end;
end;
else if
_SFormat1 = '1' then do;
_n000002 + 1;
if _s000002 <
500 then do;
if
ranuni(12345)*(767 - _n000002) <=(500 - _s000002)
then do;
_s000002 + 1;
output;
end;
end;
end;
run;
3.3 最优配置法:
指根据各区层的大小、变异程度以及抽取一个单位的费用综合权衡,确定出抽样误差小、费用低的配置方案。
这里,我们首先计算好坏样本区分的AGE变量的方差
|
RESPOND
|
Stratum Size
|
Std Deviation of AGE
|
Stratum Size *
Std Deviation of AGE
|
|
0
|
9233
|
10.06500995
|
92930.24
|
|
1
|
767
|
10.27857214
|
7883.665
|
这时,respond=0的个数为:1000*92930.24/(92930.24+7883.665) = 922
respond=1的个数为:1000*7883.665/(92930.24+7883.665) = 78
http://s11/middle/5d3b177cg8b040a9765aa&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
代码如下:
proc summary data=EMDATA.VIEW_4O9 nway missing std;
format
RESPOND BEST12.;
class
RESPOND;
var
AGE;
output
out=EMPROJ.FRQRJPB2(drop=_type_ rename=(_freq_=_npop_))
std=_std_;
run;
quit;
proc sort data=EMPROJ.FRQRJPB2 out = EMPROJ.FRQRJPB2;
by
descending _npop_;
run;
quit;
data EMPROJ.FRQRJPB2;
set EMPROJ.FRQRJPB2;
_pctpop_=.;
run;
quit;
* Respond=0时为922个, respond=1时为78个;
data EMDATA.SMPINPHW;
set EMDATA.VIEW_4O9;
drop
_n000001 _s000001 _n000002 _s000002;
length
_SFormat1 $200;
drop
_SFormat1;
_SFormat1 = trim(left(put(RESPOND,BEST12.)));
if
_SFormat1 = '0' then do;
_n000001 + 1;
if _s000001 <
922 then do;
if
ranuni(12345)*(9233 - _n000001) <=(922 - _s000001)
then do;
_s000001 + 1;
output;
end;
end;
end;
else if
_SFormat1 = '1' then do;
_n000002 + 1;
if _s000002 <
78 then do;
if
ranuni(12345)*(767 - _n000002) <=(78 - _s000002)
then do;
_s000002 + 1;
output;
end;
end;
end;
run;
3.4 用户自定义的分层抽样:
metadata抽样:
3.4.1 样本比例:设置好坏样本百分占比
http://s2/middle/5d3b177cg8b040f1efbc1&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
http://s4/middle/5d3b177cg8b040f2e5bc3&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
http://s14/middle/5d3b177cg8b040f0083dd&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
代码如下:
*本次抽样为 Metadata Sample –样本比例 Sample Proportion 80:20;
proc freq data=EMDATA.VIEW_4O9 noprint;
format
RESPOND BEST12.;
table
RESPOND /out=EMPROJ.FRQRJPB2 (rename=(count=_npop_
percent=_pctpop_)) missing;
run;
quit;
proc sort data=EMPROJ.FRQRJPB2 out=EMPROJ.FRQRJPB2;
by
descending _npop_;
run;
* Sample Proportion 80:20;
data EMDATA.SMPINPHW(label="Sample of EMDATA.VIEW_4O9.");
set EMDATA.VIEW_4O9;
drop
_n000001 _s000001 _n000002 _s000002;
length
_SFormat1 $200;
drop
_SFormat1;
_SFormat1 = trim(left(put(RESPOND,BEST12.)));
if
_SFormat1 = '0' then do;
_n000001 + 1;
if _s000001 <
800 then do;
if
ranuni(12345)*(9233 - _n000001) <=(800 - _s000001)
then do;
_s000001 + 1;
output;
end;
end;
end;
else if
_SFormat1 = '1' then do;
_n000002 + 1;
if _s000002 <
200 then do;
if
ranuni(12345)*(767 - _n000002) <=(200 - _s000002)
then do;
_s000002 + 1;
output;
end;
end;
end;
run;
3.4.2 strata比例:
设置好坏样本各占原好坏样本的比例。
http://s4/middle/5d3b177cg8b0412e216c3&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
代码如下:
*本次抽样为Metadata Sample – 好坏样本为原好坏样本的比例strata Proportion
20:20;
proc freq data=EMDATA.VIEW_4O9 noprint;
format
RESPOND BEST12.;
table
RESPOND /out=EMPROJ.FRQRJPB2 (rename=(count=_npop_
percent=_pctpop_)) missing;
run;
quit;
proc sort data=EMPROJ.FRQRJPB2 out=EMPROJ.FRQRJPB2;
by
descending _npop_;
run;
* Respond=0即9233的20%(1846个), Respond=1为767的20%(153个);
data EMDATA.SMPINPHW(label="Sample of EMDATA.VIEW_4O9.");
set EMDATA.VIEW_4O9;
drop
_n000001 _s000001 _n000002 _s000002;
length
_SFormat1 $200;
drop
_SFormat1;
_SFormat1 = trim(left(put(RESPOND,BEST12.)));
if
_SFormat1 = '0' then do;
_n000001 + 1;
if _s000001 <
1846 then do;
if
ranuni(12345)*(9233 - _n000001) <=(1846 - _s000001)
then do;
_s000001 + 1;
output;
end;
end;
end;
else if
_SFormat1 = '1' then do;
_n000002 + 1;
if _s000002 <
153 then do;
if
ranuni(12345)*(767 - _n000002) <=(153 - _s000002)
then do;
_s000002 + 1;
output;
end;
end;
end;
run;
4 FIRSTN:
直接抽取前N个样本:
http://s15/middle/5d3b177cg8b0414d1454e&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />
代码如下:
data EMDATA.SMPINPHW(label="Sample of EMDATA.VIEW_4O9.");
set EMDATA.VIEW_4O9;
if _N_ =
1001 then stop;
output;
run;
5 分群抽样:Cluster sampling
分群抽样又称整群抽样或集体抽样,是概率抽样的一种类型。具体是将总体按一定的标准分成若干群组,然后按随机原则从这些群组中抽出几个群组作为群组样本;最后在群组样本中各自抽取样本进行研究。
http://s3/middle/5d3b177cg744d357f6512&690EM:Sampling node(抽样节点)" TITLE="SAS EM:Sampling node(抽样节点)" />