极端值处理全攻略(Outliers处理)
(2014-06-16 01:27:06)
标签:
股票 |
分类: 04STATA数据处理 |
离群值的处理
SUMMARY:
1245 *2.5.1 离群值的影响
1246 * 2.5.2 查找离群值
1247 * 2.5.3 离群值的处理
1248 * 2.5.3.1 删除
1249 * 2.5.3.2 对数转换
1250 * 2.5.3.3 缩尾处理
1251 * 2.5.3.4 截尾处理
1254 * =本节命令=
1255 * ================================================
1256 * histogram, winsor, hadimvo, egen outside()
1257 * qr, adjacent, fsreg, lv
1258 * ================================================
1259
1260
1261 *--------------------
1262 *-2.5.1 离群值的影响
1263
1264 *-例:离群值对回归结果的影响
1265
1266 sysuse auto, clear
1267 histogram price
1268 count if price>13000
1269
1270 reg price weight length foreign
1271 est store r1
1272 reg price weight length foreign if price<13000
1273 est store r2
1274
1275 esttab r1 r2, mtitle("with" "without")
1276
1277 *-结论:虽然离群值只有4个,但对回归结果的影响却很大
1278
1279
1280 *-------------------
1281 *-2.5.2 查找离群值
1282
1283 * -------------
1284 * 基本概念
1285 * -------------
1286 *
1287 * 第25、50、75百分位上的数值分别称为第1、2、3四分位
1288 * 四分位间距(interquartile range): iqr = p75-p25
1289 * 上界(upper adjacent) = p75 + 1.5*iqr
1290 * 下界(lower adjacent) = p25 - 1.5*iqr
1291 *------------------------------------------------
1292
1293 *-adjacent- 命令
1294 sysuse auto, clear
1295 adjacent price
1296 adjacent price, by(foreign)
1297
1298 *-egenmore 提供的 outside() 函数
1299 egen out = outside(price)
1300 egen out2 = outside(price), factor(2)
1301 egen outby= outside(price), by(foreign) factor(2)
1302 list price out*
1303 keep if outby==. // 删除离群值
1304
1305 *-箱形图
1306 help graph box
1307 graph box price
1308 graph box price, by(foreign)
1309 graph box weight, by(foreign)
1310
1311
1312
1313 *---------------------
1314 *-2.5.3 离群值的处理
1315
1316 *-2.5.3.1 删除
1317
1318 sysuse auto, clear
1319 adjacent price, by(foreign)
1320 drop if (price>8814&foreign==0) | (price>9735&foreign==1)
1321
1322 *-or // 需要提前安装-egenmore-相关命令
1323 sysuse auto, clear
1324 egen outby = outside(price), by(foreign) factor(2)
1325 drop if outby != .
1326
1327
1328 *-2.5.3.2 对数转换
1329
1330 sysuse nlsw88, clear
1331 gen ln_wage = ln(wage)
1332
1333 twoway (histogram wage,color(green)) ///
1334 (histogram ln_wage,color(yellow))
1335
1336 sum wage ln_wage, d
1337
1338 graph box wage
1339 graph box ln_wage
1340
1341
1342 *-2.5.3.3 缩尾处理
1343
1344 sysuse nlsw88.dta, clear
1345 histogram wage
1346
1347 *-双边缩尾
1348 winsor wage, gen(wage_w2) p(0.025)
1349
1350 *-图示
1351 twoway (histogram wage,color(green)) ///
1352 (histogram wage_w2,color(yellow)), ///
1353 legend(label(1 "wage") label(2 "wage_winsor2"))
1354
1355 *-单边缩尾
1356 winsor wage, gen(wage_h) p(0.025) highonly
1357 *-图示
1358 twoway (histogram wage,color(green)) ///
1359 (histogram wage_h,color(yellow)), ///
1360 legend(label(1 "wage") label(2 "wage_winsorH"))
1361
1362
1363 *-若无法下载 winsor 命令,可以采用如下处理方法:
1364
1365 _pctile wage, percentile(1 99)
1366 replace wage = r(r1) if wage<</span>r(r1)
1367 replace wage = r(r2) if wage>r(r2)
1368
1369 *-亦可采用 clip() 函数 (参见第2.1.2.6小节,第385行)
1370 gen wage_w = clip(wage, r(r1), r(r2))
1371 sum wage wage_w, detail
1372
1373
1374
1375 *-2.5.3.4 截尾处理
1376
1377 sysuse nlsw88, clear
1378 _pctile wage, percentile(1 99)
1379 return list
1380 drop if wage<</span>r(r1) // 删除小于第1百分位的样本
1381 drop if wage>r(r2) // 删除大于第99百分位的样本
1382
1383 *-说明:
1384 * (1) 可以先绘制直方图,进而根据分布情况选择左截尾、
1385 * 右截尾还是双边截尾
1386 * (2) 相比于ln()处理和winsor处理,该处理会损失样本
1387 * 但对于大样本而言,该方法比较“干净”