根据R中的分类变量比较两组

Question

I have created df which contains more than 8,000 firm years我创建了包含超过 8,000 个公司年的df

gvkey = company id gvkey = 公司 ID

fam = dummy (equals 1 if firm is family firm) fam = dummy（如果公司是家族公司则等于 1）

industry = categorial variable industry = 类别变量

   gvkey   fam  industry
1   1004    0     6
2   1004    0     6
3   1004    0     6
4   1004    0     6
5   1004    0     6
6   1013    0     4
7   1013    0     4
8   1013    0     4
9   1013    0     4
10  1013    0     4
11  1013    0     4
12  1045    0     5
13  1045    0     5
14  1045    0     5
15  1045    0     5
16  1045    0     5
17  1045    0     5
18  1072    0     4
19  1072    0     4
20  1072    0     4
21  1072    0     4
22  1072    0     4
23  1076    1     9
24  1076    1     9
25  1076    1     9
26  1076    1     9
27  1076    1     9
28  1076    1     9
29  1078    0     4
30  1078    0     4
31  1078    0     4
32  1078    0     4
33  1078    0     4
34  1078    0     4
35  1121    1     6
36  1121    1     6
37  1121    1     6
38  1121    1     6
39  1121    1     6
40  1121    1     6
41  1161    0     4
42  1161    0     4
43  1161    0     4
44  1161    0     4
45  1161    0     4
46  1161    0     4
47  1209    0     4
48  1209    0     4
49  1209    0     4
50  1209    0     4
...

This is how the output should kind of look like.这就是输出的样子。 Industry description = industry行业描述 = industry

verbal logic:语言逻辑：

1) For all unique gvkey create a column which counts the number of fam = 0 in each industry. 1) 为所有唯一的gvkey创建一个列，计算每个行业中 fam = 0 的数量。

2) For all unique gvkey create a column which counts the number of fam = 1 in each industry. 2) 为所有唯一的gvkey创建一个列，计算每个行业中 fam = 1 的数量。

3) Create an output which shows the frequencies of family firms and non family firms for each idnustry 3) 创建一个输出，显示每个行业的家族企业和非家族企业的频率

Maybe it even possible to execute this in one code?!也许甚至可以在一个代码中执行它？！

Thank you so much!!非常感谢！！

Answer 1

Your verbal logic is not very clear to me (particularly the statements regarding unique gvkey for the final output), but here I provide two results so you can see which one is the thing you want:你的语言逻辑对我来说不是很清楚（特别是关于最终输出的唯一gvkey的陈述），但在这里我提供了两个结果，所以你可以看到哪一个是你想要的：

result 1 : using unique(df) for count结果 1 ：使用unique(df)进行计数

dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
                                                     unique(df),
                                                     FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))), 
                      c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))

such that以至于

> dfout
  Industry FamCnt NoFamCnt FamPerc
1        4      5        0       0
2        5      1        0       0
3        6      1        1      50
4        9      0        1     100

result 2 : using df for count结果 2 ：使用df进行计数

dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
                                                     df,
                                                     FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))), 
                      c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))

such that以至于

> dfout
  Industry FamCnt NoFamCnt   FamPerc
1        4     27        0   0.00000
2        5      6        0   0.00000
3        6      5        6  54.54545
4        9      0        6 100.00000

Answer 2

One dplyr otion could be:一个dplyr选项可以是：

df %>%
 group_by(industry) %>%
 summarise(n_family = n_distinct(gvkey[fam == 1]),
           n_no_family = n_distinct(gvkey[fam == 0]),
           perc_family = n_family/n_distinct(gvkey)*100) 

  industry n_family n_no_family perc_family
     <int>    <int>       <int>       <dbl>
1        4        0           5           0
2        5        0           1           0
3        6        1           1          50
4        9        1           0         100

Answer 3

Base R solution (note: it isn't typically good practice to keep spaces in vector names) Base R 解决方案（注意：在向量名称中保留空格通常不是一个好习惯）

# Reshape / Rename the input data: 

ir_df <- setNames(reshape(setNames(aggregate(.~fam+industry, df, length),

                                           c("fam", "industry", "count")),
               direction = "wide",

               idvar = "industry", 

               timevar = "fam"), c("Industry", "Nonfamily Firms", "Family Firms"))

# Transform the data frame to contain the final equation: 

final_df <- transform(replace(ir_df, is.na(ir_df), 0), 

                      `Percent Family Firms In Industry` = 

                        round(`Family Firms` /

                        rowSums(ir_df[,grepl("family", tolower(names(ir_df)))], na.rm = TRUE)

                      * 100, 2))

Data:数据：

df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L, 1004L, 1013L, 
1013L, 1013L, 1013L, 1013L, 1013L, 1045L, 1045L, 1045L, 1045L, 
1045L, 1045L, 1072L, 1072L, 1072L, 1072L, 1072L, 1076L, 1076L, 
1076L, 1076L, 1076L, 1076L, 1078L, 1078L, 1078L, 1078L, 1078L, 
1078L, 1121L, 1121L, 1121L, 1121L, 1121L, 1121L, 1161L, 1161L, 
1161L, 1161L, 1161L, 1161L, 1209L, 1209L, 1209L, 1209L), fam = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), industry = c(6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L, 
9L, 4L, 4L, 4L, 4L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA, 
-50L))

根据R中的分类变量比较两组

问题描述

3 个解决方案

解决方案1
2 2019-12-18 11:04:17

解决方案2
1 已采纳 2019-12-18 10:34:12

解决方案3
0 2019-12-18 11:07:24

根据R中的分类变量比较两组

问题描述

3 个解决方案

解决方案1 2 2019-12-18 11:04:17

解决方案2 1 已采纳 2019-12-18 10:34:12

解决方案3 0 2019-12-18 11:07:24

解决方案1
2 2019-12-18 11:04:17

解决方案2
1 已采纳 2019-12-18 10:34:12

解决方案3
0 2019-12-18 11:07:24