简体   繁体   English

根据R中的分类变量比较两组

[英]compare two groups based on categorial variable in R

I have created df which contains more than 8,000 firm years我创建了包含超过 8,000 个公司年的df

gvkey = company id gvkey = 公司 ID

fam = dummy (equals 1 if firm is family firm) fam = dummy(如果公司是家族公司则等于 1)

industry = categorial variable industry = 类别变量

   gvkey   fam  industry
1   1004    0     6
2   1004    0     6
3   1004    0     6
4   1004    0     6
5   1004    0     6
6   1013    0     4
7   1013    0     4
8   1013    0     4
9   1013    0     4
10  1013    0     4
11  1013    0     4
12  1045    0     5
13  1045    0     5
14  1045    0     5
15  1045    0     5
16  1045    0     5
17  1045    0     5
18  1072    0     4
19  1072    0     4
20  1072    0     4
21  1072    0     4
22  1072    0     4
23  1076    1     9
24  1076    1     9
25  1076    1     9
26  1076    1     9
27  1076    1     9
28  1076    1     9
29  1078    0     4
30  1078    0     4
31  1078    0     4
32  1078    0     4
33  1078    0     4
34  1078    0     4
35  1121    1     6
36  1121    1     6
37  1121    1     6
38  1121    1     6
39  1121    1     6
40  1121    1     6
41  1161    0     4
42  1161    0     4
43  1161    0     4
44  1161    0     4
45  1161    0     4
46  1161    0     4
47  1209    0     4
48  1209    0     4
49  1209    0     4
50  1209    0     4
...

This is how the output should kind of look like.这就是输出的样子。 Industry description = industry行业描述 = industry

这是我想在我的论文中创建的最终输出。专栏行业描述等于我的专栏行业

verbal logic:语言逻辑:

1) For all unique gvkey create a column which counts the number of fam = 0 in each industry. 1) 为所有唯一的gvkey创建一个列,计算每个行业中 fam = 0 的数量。

2) For all unique gvkey create a column which counts the number of fam = 1 in each industry. 2) 为所有唯一的gvkey创建一个列,计算每个行业中 fam = 1 的数量。

3) Create an output which shows the frequencies of family firms and non family firms for each idnustry 3) 创建一个输出,显示每个行业的家族企业和非家族企业的频率

Maybe it even possible to execute this in one code?!也许甚至可以在一个代码中执行它?!

Thank you so much!!非常感谢!!

Your verbal logic is not very clear to me (particularly the statements regarding unique gvkey for the final output), but here I provide two results so you can see which one is the thing you want:你的语言逻辑对我来说不是很清楚(特别是关于最终输出的唯一gvkey的陈述),但在这里我提供了两个结果,所以你可以看到哪一个是你想要的:

  • result 1 : using unique(df) for count结果 1 :使用unique(df)进行计数
dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
                                                     unique(df),
                                                     FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))), 
                      c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))

such that以至于

> dfout
  Industry FamCnt NoFamCnt FamPerc
1        4      5        0       0
2        5      1        0       0
3        6      1        1      50
4        9      0        1     100
  • result 2 : using df for count结果 2 :使用df进行计数
dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
                                                     df,
                                                     FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))), 
                      c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))

such that以至于

> dfout
  Industry FamCnt NoFamCnt   FamPerc
1        4     27        0   0.00000
2        5      6        0   0.00000
3        6      5        6  54.54545
4        9      0        6 100.00000

One dplyr otion could be:一个dplyr选项可以是:

df %>%
 group_by(industry) %>%
 summarise(n_family = n_distinct(gvkey[fam == 1]),
           n_no_family = n_distinct(gvkey[fam == 0]),
           perc_family = n_family/n_distinct(gvkey)*100) 

  industry n_family n_no_family perc_family
     <int>    <int>       <int>       <dbl>
1        4        0           5           0
2        5        0           1           0
3        6        1           1          50
4        9        1           0         100

Base R solution (note: it isn't typically good practice to keep spaces in vector names) Base R 解决方案(注意:在向量名称中保留空格通常不是一个好习惯)

# Reshape / Rename the input data: 

ir_df <- setNames(reshape(setNames(aggregate(.~fam+industry, df, length),

                                           c("fam", "industry", "count")),
               direction = "wide",

               idvar = "industry", 

               timevar = "fam"), c("Industry", "Nonfamily Firms", "Family Firms"))

# Transform the data frame to contain the final equation: 

final_df <- transform(replace(ir_df, is.na(ir_df), 0), 

                      `Percent Family Firms In Industry` = 

                        round(`Family Firms` /

                        rowSums(ir_df[,grepl("family", tolower(names(ir_df)))], na.rm = TRUE)

                      * 100, 2))

Data:数据:

df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L, 1004L, 1013L, 
1013L, 1013L, 1013L, 1013L, 1013L, 1045L, 1045L, 1045L, 1045L, 
1045L, 1045L, 1072L, 1072L, 1072L, 1072L, 1072L, 1076L, 1076L, 
1076L, 1076L, 1076L, 1076L, 1078L, 1078L, 1078L, 1078L, 1078L, 
1078L, 1121L, 1121L, 1121L, 1121L, 1121L, 1121L, 1161L, 1161L, 
1161L, 1161L, 1161L, 1161L, 1209L, 1209L, 1209L, 1209L), fam = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), industry = c(6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L, 
9L, 4L, 4L, 4L, 4L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA, 
-50L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM