加载中…
个人资料
  • 博客等级:
  • 博客积分:
  • 博客访问:
  • 关注人气:
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

[转载]stata命令 (一) egen

(2015-09-10 07:21:45)
标签:

转载

原文地址:stata命令 (一) egen作者:徐鑫
stata命令(一)egen
1.1 sum() total() pc() rowtotal()

下为stata11 Manual中的例子,从上里来看,sum()和total()两个函数是不同的,其中sum()为按照下标动态累加。而total()是求总和。但应该 注意的是这里是stata's sum() VS egen's total(),如果在egen下使用sum(),作用与total()相同。

Distinguish carefully between Stata’s sum()function and egen’s total()function. Stata’s sum()function creates the running sum, whereas egen’s total() function creates a constant equal to the overall sum.

For example:

clear
set obs 5
gen a=_n
gen sum1=sum(a)
egen sum2=total(a)
list

执行结果如下:
. do "C:DOCUME~1ADMINI~1LOCALS~1TempSTD00000000.tmp"

. clear

. set obs 5
obs was 0, now 5

. gen a=_n

. gen sum1=sum(a)

. egen sum2=total(a)

.
end of do-file

. list

     +-----------------+
     | a   sum1   sum2 |
     |-----------------|
  1. | 1         15 |
  2. | 2         15 |
  3. | 3         15 |
  4. | 4     10     15 |
  5. | 5     15     15 |
     +-----------------+




如果将上述例子稍微修改,将第四行代码中gen改为egen如下:

clear
set obs 5
gen a=_n
egen sum1=sum(a)
egen sum2=total(a)
list

执行结果如下:

. do "C:DOCUME~1ADMINI~1LOCALS~1TempSTD00000000.tmp"

. clear

. set obs 5
obs was 0, now 5

. gen a=_n

. egen sum1=sum(a)

. egen sum2=total(a)

.
end of do-file

. list

     +-----------------+
     | a   sum1   sum2 |
     |-----------------|
  1. | 1     15     15 |
  2. | 2     15     15 |
  3. | 3     15     15 |
  4. | 4     15     15 |
  5. | 5     15     15 |
     +-----------------+

.
sum()、total()两个命令的应用:
昨天在人大经济论坛stata专版看到有人求助,链接如下
http://bbs.pinggu.org/forum.php?mod=viewthread&tid=1331823&page=1#pid11686881

http://bbs.pinggu.org/static/image/common/online_member.gif(一) egen" TITLE="[转载]stata命令 (一) egen" /> 发表于 2012-1-30 22:10:52 |
尊敬的达人:请问如何用stata命令求在下列四组每个公司在各自行业中的销售额比率?如行业1,先求行业内四个公司的销售总额,然后求A1,A2...各自占行业的份额。非常感谢!

公司

Sales

   行业

A1

27.72

1

A2

26.37

1

A3

24.79

1

A4

18.69

1

B1

17.48

2

B2

17.04

2

B3

10.87

2

B4

6.68

2

C1

9.06

3

C2

6.8

3

C3

8.85

3

C4

9.43

3

D1

11.48

4

D2

13.96

4

D3

14.19

4

D4

17.93

4


使用total()或者sum()函数就可以完成这个任务,为避免中文无法识别,将行业变量命名为industry
代码如下:
by industry ,sort : egen sale_s=total(Sales)
gen ratio=Sales/sale_s

当然,stata中还有一个一直的求所占比率的函数pc(),这个也是egen命令的fnc之一。
改进的命令为:
by industry, sort: egen ratio=pc(Sales),prop

关于pc()

pc(exp)[,prop]                                                        (allows by varlist)
  returens exp (within varlist) scaled to be a percentage of the total, between 0 and 100. The prop option returns exp scaled to be a proption of the total, between 0 and 1.

此为stata11中egen函数下pc()的说明,从内容来看,pc()这一function可以返回变量某一取值占总和的比重,加上prop这一option之后,可将百分数改为0到1之间的小数。而且,此function可与by连用,从而功能更加强大。

关于rowtotal():

generate’s sum()function creates the vertical, running sum of its argument, whereas egen’s total()function creates a constant equal to the overall sum. egen’s rowtotal()function, however, creates the horizontal sum of its arguments. They all treat missing as zero. However, if the missing option is specified with total()or rowtotal(), then newvar will contain missing values if all values of exp or varlist are missing.

从这段话来看,sum()函数做的是纵向合并,total()函数得到的是最终的和,二rowtotal()函数则是横向求和。三种函数都把缺省值视为数值0。

例子:
. do "C:DOCUME~1ADMINI~1LOCALS~1TempSTD00000000.tmp"

. webuse egenxmpl4,clear

. egen hsum=rowtotal(a b c)

. generate vsum=sum(hsum)

. egen sum=total(hsum)

. list

     +----------------------------------+
     |  a    b    c   hsum   vsum   sum |
     |----------------------------------|
  1. |  .    2    3      5      5    63 |
  2. |  4    .    6     10     15    63 |
  3. |  7    8    .     15     30    63 |
  4. | 10   11   12     33     63    63 |
     +----------------------------------+

end of do-file

这个例子生动的展示了sum() total() rowtotal() 之间的区别。


1.2 pctile() rank() group()

将这两个函数放在一起,是因为这两个函数都涉及到了序列的顺序问题。pctile是分位数,rank则涉及到序列的排序位次,这些都是“有序”的,儿group()则不同,并非按照升序或者降序排列

pctile()

pctile()[,p(#)]                                                              (allows by varlist)
creats a constant(within varlist) containing the #th percentile of exp. if # is not specified, 50 is assummed, meaning medians. Also see median()
 
此为pctile()的说明,从内容来看,pctile()函数会计算出某一变量每一个百分位数对应的取值,默认为中位数。

rank()
rank(exp)[,field|track|unique]                                               (allows by varlist)
  creats ranks(within varlist) of exp; by default,equal observations are assigned the average rank. The field option calculate the field rank of exp; the highest value is ranked 1, and there is no corection for ties. That is, the field rank is 1 + the number of values that are higher.The track option calculate the track rank of exp; the lowest value is ranked 1, and there is no correction for ties. That is, the rank is 1 + the number of values that are lower. The unique option calculates unique rank of exp, values are ranked 1,…,#, and values and ties are broken arbitrarily. Two values are tied for second are ranked 2 and 3.

Example 7: rank( )
Most applications of rank() will be to one variable, but the argument exp can be more general, namely, an expression. In particular, rank(-varname) reverses ranks from those obtained by rank(varname).
The default ranking and those obtained by using one of the track, field, and unique options
differ principally in their treatment of ties. The default is to assign the same rank to tied values such that the sum of the ranks is preserved. The track option assigns the same rank but resembles the convention in track events; thus, if one person had the lowest time and three persons tied for second-lowest time, their ranks would be 1, 2, 2, and 2, and the next person(s) would have rank 5. The field option acts similarly except that the highest is assigned rank 1, as in field events in which the greatest distance or height wins. The unique option breaks ties arbitrarily: its most obvious use is assigning ranks for a graph of ordered values. See also group() for another kind of “ranking”.

执行如下代码:
sysuse auto,clear
keep in 1/10
egen rank=rank(mpg)
egen rank_r=rank(-mpg)
egen rank_f=rank(mpg),field
egen rank_t=rank(mpg),track
egen rank_u=rank(mpg),unique
egen rank_ur=rank(-mpg),unique
sort rank_u
list mpg rank*

运行结果:
. do "C:DOCUME~1ADMINI~1LOCALS~1TempSTD00000000.tmp"

. sysuse auto,clear
(1978 Automobile Data)

. keep in 1/10
(64 observations deleted)

. egen rank=rank(mpg)

. egen rank_r=rank(-mpg)

. egen rank_f=rank(mpg),field

. egen rank_t=rank(mpg),track

. egen rank_u=rank(mpg),unique

. egen rank_ur=rank(-mpg),unique

. sort rank_u

. list mpg rank*

     +----------------------------------------------------------+
     | mpg   rank   rank_r   rank_f   rank_t   rank_u   rank_ur |
     |----------------------------------------------------------|
  1. |  15      1       10       10        1        1        10 |
  2. |  16      2        9        9        2        2         9 |
  3. |  17      3        8        8        3        3         8 |
  4. |  18      4        7        7        4        4         7 |
  5. |  19      5        6        6        5        5         6 |
     |----------------------------------------------------------|
  6. |  20    6.5      4.5        4        6        6         4 |
  7. |  20    6.5      4.5        4        6        7         5 |
  8. |  22    8.5      2.5        2        8        8         3 |
  9. |  22    8.5      2.5        2        8        9         2 |
 10. |  26     10        1        1       10       10         1 |
     +----------------------------------------------------------+

end of do-file

从上文的说明以及这个例子不难看出,rank()函数需要注意的内容有两点
1.rank(-var)相当于是rank(var)的逆序
2.rank()有三个option,分别是track,field,unique。这三个选项以及缺省的不同之处在于对相等数据(tie)的处理方式上。
The default is to assign the same rank to tied values such that the sum of the ranks is preserved. The track option assigns the same rank but resembles the convention in track events; thus, if one person had the lowest time and three persons tied for second-lowest time, their ranks would be 1, 2, 2, and 2, and the next person(s) would have rank 5. The field option acts similarly except that the highest is assigned rank 1, as in field events in which the greatest distance or height wins. The unique option breaks ties arbitrarily: its most obvious use is assigning ranks for a graph of ordered values.

group()

group() maps the distinct groups of a varlist to a categorical variable that takes on integer values from 1 to the total number of groups. order of the groups is that of the sort order of varlist. The varlist may be of numeric variables, string variables, or a mixture of the two. The resulting variable can be useful for many purposes, including stepping through the distinct groups easily and systematically and cleaning up an untidy ordering. Suppose that the actual (and arbitrary) codes present in the data are 1, 2, 4, and 7, but we desire equally spaced numbers, as when the codes will be values on one axis of a graph.group()maps these to 1, 2, 3, and 4.


0

  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有