STATA 缺失值的处理
(2016-01-20 14:22:07)分类: 04STATA数据处理 |
【问题】
有时候,整理一份数据,或者拿到一份数据,想看一下变量缺失情况。
【命令】
nmissing,npresent
【例子】
ssc install nmissing
use yourdata,clear
nmising
nmising var1
nmising,min(10)
npresent
How can I replace missing values with previous or following nonmissing values or within sequences?
Title |
|
Replacing missing values |
Author | Nicholas J. Cox, Durham University, UK | |
Date | August 2000; updated January 2012 |
1. The problems
Users often want to replace missing values by neighboring nonmissing values, particularly when observations occur in some definite order, often (but not always) a time order. Typically, this occurs when values of some variable should be identical within blocks of observations, but, for some reason, values are explicitly nonmissing within the dataset only for certain observations, most often the first. So, there is a wish to copy values within blocks of observations.
Alternatively, users often want to replace missing values in a sequence, usually in a time sequence. These problems can be solved with similar methods.
A different situation, not addressed directly in this FAQ, is when values of some time-varying variable are known only for certain observations. There is then a need for imputation or interpolation between known values. Copying the last value forward is unlikely to be a good method of interpolation unless, as just stated, it is known that values remained constant at a stated level until the next stated level. Either way, users applying the methods described here for imputation or interpolation take on the responsibility for what they do.
2. Without tsset: copying nonmissing values
Let us first look at the case where you have
not
. sort time
If missing values occurred singly, then they could be replaced by the previous value
. replace myvar = myvar[_n-1] if missing(myvar)
or by the following value
. replace myvar = myvar[_n+1] if missing(myvar)
Here the subscript notation used is that
missing(myvar)
. replace myvar = myvar[_n+1] if myvar >= .
because
. replace myvar = myvar[_n+1] if myvar == ""
would be correct syntax, not the previous command, because the
empty string
3. Copying previous values downwards: the cascade effect
Missing values may occur in blocks of two or more. Suppose you want to replace missings by the previous nonmissing value, whenever it occurred, so that given
_n myvar 1 42 2 . 3 . 4 56 5 67 6 78
you want to
. replace myvar = 42 in 2/3
is an interactive solution, but, for larger datasets, you need a
more systematic way of proceeding. To get this, it helps to know
that
. replace myvar = myvar[_n-1] if myvar >= .
achieves this purpose.
What if you want to use the previous value only and do not want this cascade effect? You need to copy the variable and replace from that:
. gen mycopy = myvar . replace myvar = mycopy[_n-1] if myvar >= .
No replacement is being made in
4. Copying following values upwards
The opposite case is replacement by following values, but,
because
. gsort -time . replace myvar = myvar[_n-1] if myvar >= .
gsort
. replace myvar = myvar[_n+1] if myvar >= .
does not produce a cascade effect.
Once again, nothing can be done about any missing values at the end
of the series (placed at the beginning after
the
. sort time
5. Complications: several variables and panel structure
Two common complications are
-
You want to do this with several variables:
use
foreach. sort or gsort once, replace all variables using foreach, and, if necessary, sort back again. - You have panel data, so the appropriate replacement is a neighboring nonmissing value for each individual in the panel.
Suppose that individuals are identified
by
. by id (time), sort: replace myvar = myvar[_n-1] if myvar >= .
or
. gsort id -time . quietly by id: replace myvar = myvar[_n-1] if myvar >= . . sort id time
The key to many data management problems with panel data lies in
following
6. With tsset
If you have
. tsset time
then
. replace myvar = L.myvar if myvar >= .
has the effect of copying in cascade, whereas
. replace myvar = F.myvar if myvar >=.
has no such effect. The value of
7. Missing values in sequences
In some datasets, time variables come with gaps, something like
_n year 1 . 2 . 3 1990 4 . 5 . 6 . 7 . 8 1995 9 . 10 .
We can use a similar method and rely on cascading:
. replace year = 1988 in 1 . replace year = year[_n-1] + 1 if missing(year)
The difference is simply that each value is one more than the
previous one. If data were once per decade, each value would be 10
more, and so forth. Again missing values at the beginning of a
sequence need special surgery, as shown here.
With