[英]compare two groups based on categorial variable in R
I have created df
which contains more than 8,000 firm years我创建了包含超过 8,000 个公司年的
df
gvkey
= company id gvkey
= 公司 ID
fam
= dummy (equals 1 if firm is family firm) fam
= dummy(如果公司是家族公司则等于 1)
industry
= categorial variable industry
= 类别变量
gvkey fam industry
1 1004 0 6
2 1004 0 6
3 1004 0 6
4 1004 0 6
5 1004 0 6
6 1013 0 4
7 1013 0 4
8 1013 0 4
9 1013 0 4
10 1013 0 4
11 1013 0 4
12 1045 0 5
13 1045 0 5
14 1045 0 5
15 1045 0 5
16 1045 0 5
17 1045 0 5
18 1072 0 4
19 1072 0 4
20 1072 0 4
21 1072 0 4
22 1072 0 4
23 1076 1 9
24 1076 1 9
25 1076 1 9
26 1076 1 9
27 1076 1 9
28 1076 1 9
29 1078 0 4
30 1078 0 4
31 1078 0 4
32 1078 0 4
33 1078 0 4
34 1078 0 4
35 1121 1 6
36 1121 1 6
37 1121 1 6
38 1121 1 6
39 1121 1 6
40 1121 1 6
41 1161 0 4
42 1161 0 4
43 1161 0 4
44 1161 0 4
45 1161 0 4
46 1161 0 4
47 1209 0 4
48 1209 0 4
49 1209 0 4
50 1209 0 4
...
This is how the output should kind of look like.这就是输出的样子。 Industry description =
industry
行业描述 =
industry
verbal logic:语言逻辑:
1) For all unique gvkey
create a column which counts the number of fam = 0 in each industry. 1) 为所有唯一的
gvkey
创建一个列,计算每个行业中 fam = 0 的数量。
2) For all unique gvkey
create a column which counts the number of fam = 1 in each industry. 2) 为所有唯一的
gvkey
创建一个列,计算每个行业中 fam = 1 的数量。
3) Create an output which shows the frequencies of family firms and non family firms for each idnustry 3) 创建一个输出,显示每个行业的家族企业和非家族企业的频率
Maybe it even possible to execute this in one code?!也许甚至可以在一个代码中执行它?!
Thank you so much!!非常感谢!!
Your verbal logic is not very clear to me (particularly the statements regarding unique gvkey
for the final output), but here I provide two results so you can see which one is the thing you want:你的语言逻辑对我来说不是很清楚(特别是关于最终输出的唯一
gvkey
的陈述),但在这里我提供了两个结果,所以你可以看到哪一个是你想要的:
unique(df)
for countunique(df)
进行计数dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
unique(df),
FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))),
c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))
such that以至于
> dfout
Industry FamCnt NoFamCnt FamPerc
1 4 5 0 0
2 5 1 0 0
3 6 1 1 50
4 9 0 1 100
df
for countdf
进行计数dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
df,
FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))),
c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))
such that以至于
> dfout
Industry FamCnt NoFamCnt FamPerc
1 4 27 0 0.00000
2 5 6 0 0.00000
3 6 5 6 54.54545
4 9 0 6 100.00000
One dplyr
otion could be:一个
dplyr
选项可以是:
df %>%
group_by(industry) %>%
summarise(n_family = n_distinct(gvkey[fam == 1]),
n_no_family = n_distinct(gvkey[fam == 0]),
perc_family = n_family/n_distinct(gvkey)*100)
industry n_family n_no_family perc_family
<int> <int> <int> <dbl>
1 4 0 5 0
2 5 0 1 0
3 6 1 1 50
4 9 1 0 100
Base R solution (note: it isn't typically good practice to keep spaces in vector names) Base R 解决方案(注意:在向量名称中保留空格通常不是一个好习惯)
# Reshape / Rename the input data:
ir_df <- setNames(reshape(setNames(aggregate(.~fam+industry, df, length),
c("fam", "industry", "count")),
direction = "wide",
idvar = "industry",
timevar = "fam"), c("Industry", "Nonfamily Firms", "Family Firms"))
# Transform the data frame to contain the final equation:
final_df <- transform(replace(ir_df, is.na(ir_df), 0),
`Percent Family Firms In Industry` =
round(`Family Firms` /
rowSums(ir_df[,grepl("family", tolower(names(ir_df)))], na.rm = TRUE)
* 100, 2))
Data:数据:
df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L, 1004L, 1013L,
1013L, 1013L, 1013L, 1013L, 1013L, 1045L, 1045L, 1045L, 1045L,
1045L, 1045L, 1072L, 1072L, 1072L, 1072L, 1072L, 1076L, 1076L,
1076L, 1076L, 1076L, 1076L, 1078L, 1078L, 1078L, 1078L, 1078L,
1078L, 1121L, 1121L, 1121L, 1121L, 1121L, 1121L, 1161L, 1161L,
1161L, 1161L, 1161L, 1161L, 1209L, 1209L, 1209L, 1209L), fam = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), industry = c(6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L,
9L, 4L, 4L, 4L, 4L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA,
-50L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.