简体   繁体   English

使用dplyr和ifelse创建新变量,其中使用多个条件

[英]creating a new variable using dplyr and ifelse where more than one conditions are used

I have the following data set. 我有以下数据集。 I want to create a variable called "specialized". 我想创建一个名为“ specialized”的变量。 For creating the variable, I need to group the data by using group_by (sic, year). 为了创建变量,我需要使用group_by(原文如此,年份)对数据进行分组。 Then the dummy variable "specialized" will be created - 然后将创建虚拟变量“ specialized”-

if in a given "year and "sic", the "percentage" variable is the highest AND the difference between the highest percentage and the second highest percentage is greater than 10, then it will be coded "1", "0" otherwise. 如果在给定的“年份和年”中,“百分比”变量是最高变量,并且最高百分比和第二最高百分比之间的差大于10,则将其编码为“ 1”,否则为“ 0”。

However, Note that if in a given "year" and "sic", there is no second highest percentage - meaning that only one percentage, which is the highest - then it will be coded 1. This kind of case is "sic ==0100" in "year==2000"in my data set. 但是,请注意,如果在给定的“年”和“ sic”中没有第二高的百分比-意味着只有一个百分比是最高的-则它将被编码为1。这种情况是“ sic ==我的数据集中“ year == 2000”中的“ 0100”。

I tried the following code 我尝试了以下代码

df <- df %>% 
  group_by(sic, year) %>% 
  mutate(SPECIALIZED = ifelse(max(percentage) && (max(percentage)-nth(sort(percentage), - 2)) > 10), 1, 0 ) %>% 
  ungroup()

But it does not work. 但这行不通。

Here is the data - 这是数据-

   gvkey auditor_fkey  year  sic  percentage
1  001266            4 2001 0100  26.9605909
2  003107            2 2000 1000  37.0939127
3  003107            2 2000 1000  37.0939127
4  003107            2 2001 1000   9.8899690
5  003107            2 2001 1000   9.8899690
6  005560            1 2000 1040 100.0000000
7  005560            7 2001 1040   8.2959428
8  007881            5 2001 1040  71.1026743
9  009728          597 2001 1040   1.0906007
10 009728          597 2001 1040   1.0906007
11 010390            2 2000 0100 100.0000000
12 010390            2 2000 0100 100.0000000
13 010390            2 2001 0100  73.0394091
14 010390            2 2001 0100  73.0394091
15 012321            1 2001 1040  18.1873703
16 012321            1 2001 1040  18.1873703
17 014590            5 2000 1000  60.6862904
18 014590            5 2000 1000  60.6862904
19 014590            5 2001 1000  18.8287898
20 014590            5 2001 1000  18.8287898
21 014793            2 2000 1220  34.7515455
22 014793            2 2000 1220  34.7515455
23 014793            2 2001 1220  58.0859392
24 014793            2 2001 1220  58.0859392
25 015274            1 2000 1220  65.2484545
26 015274            1 2000 1220  65.2484545
27 015274            1 2001 1220  41.9140608
28 015274            1 2001 1220  41.9140608
29 019565            1 2001 1000  71.1457384
30 019565            1 2001 1000  71.1457384
31 020488            1 2000 1040 100.0000000
32 020488            1 2001 1040  18.1873703
33 025776            1 2000 1000   2.2197969
34 025776            1 2001 1000  71.1457384
35 031626            2 2000 1000  37.0939127
36 031626            2 2001 1000   9.8899690
37 061811            5 2000 1000  60.6862904
38 061811            5 2001 1000  18.8287898
39 061811            5 2001 1000  18.8287898
40 064134          580 2001 1000   0.1355028
41 064134          580 2001 1000   0.1355028
42 065921            1 2000 1040 100.0000000
43 065921            1 2000 1040 100.0000000
44 065921            1 2001 1040  18.1873703
45 065921            1 2001 1040  18.1873703
46 102341            2 2001 1040   1.3234119
47 142460            2 2001 1220  58.0859392
48 142460            2 2001 1220  58.0859392
49 142460            2 2001 1220  58.0859392

The final data set should be look like this -- 最终数据集应如下所示-

    gvkey auditor_fkey year sic  percentage      specialized
1   10390            2 2000 0100 100.0000000           1
2   10390            2 2000 0100 100.0000000           1
3    3107            2 2000 1000  37.0939127           0
4    3107            2 2000 1000  37.0939127           0
5   14590            5 2000 1000  60.6862904           1
6   14590            5 2000 1000  60.6862904           1
7   25776            1 2000 1000   2.2197969           0
8   31626            2 2000 1000  37.0939127           0
9   61811            5 2000 1000  60.6862904           1
10   5560            1 2000 1040 100.0000000           1
11  20488            1 2000 1040 100.0000000           1
12  65921            1 2000 1040 100.0000000           1
13  65921            1 2000 1040 100.0000000           1
14  14793            2 2000 1220  34.7515456           0
15  14793            2 2000 1220  34.7515456           0
16  15274            1 2000 1220  65.2484544           1
17  15274            1 2000 1220  65.2484544           1
18   1266            4 2001 0100  26.9605909           0
19  10390            2 2001 0100  73.0394091           1
20  10390            2 2001 0100  73.0394091           1
21   3107            2 2001 1000   9.8899690           0
22   3107            2 2001 1000   9.8899690           0
23  14590            5 2001 1000  18.8287898           0
24  14590            5 2001 1000  18.8287898           0
25  19565            1 2001 1000  71.1457384           1
26  19565            1 2001 1000  71.1457384           1
27  25776            1 2001 1000  71.1457384           1
28  31626            2 2001 1000   9.8899690           0
29  61811            5 2001 1000  18.8287898           0
30  61811            5 2001 1000  18.8287898           0
31  64134          580 2001 1000   0.1355028           0
32  64134          580 2001 1000   0.1355028           0
33   5560            7 2001 1040   8.2959428           0
34   7881            5 2001 1040  71.1026743           1
35   9728          597 2001 1040   1.0906007           0
36   9728          597 2001 1040   1.0906007           0
37  12321            1 2001 1040  18.1873703           0
38  12321            1 2001 1040  18.1873703           0
39  20488            1 2001 1040  18.1873703           0
40  65921            1 2001 1040  18.1873703           0
41  65921            1 2001 1040  18.1873703           0
42 102341            2 2001 1040   1.3234119           0
43  14793            2 2001 1220  58.0859392           1
44  14793            2 2001 1220  58.0859392           1
45  15274            1 2001 1220  41.9140608           0
46  15274            1 2001 1220  41.9140608           0
47 142460            2 2001 1220  58.0859392           1
48 142460            2 2001 1220  58.0859392           1
49 142460            2 2001 1220  58.0859392           1

I appreciate your help. 我感谢您的帮助。

The order changed in you data and the expected result. 顺序已在您的数据和预期结果中更改。 So I took the data from the result instead. 因此我改为从结果中获取数据。 Here is breakdown of the logic into seperate columns before creating the dummy with dummy from hablar . 下面是创建与虚设之前逻辑成单独列的击穿dummyhablar

library(hablar)
library(dplyr)

df %>% 
  group_by(sic, year) %>% 
  mutate(second_highest = nth(sort(unique(percentage), decreasing = T), 2), 
         max_value = max(percentage),
         is_max   = percentage == max_value,
         is_ab_10 = (max_value - second_highest) > 10,
         specialized = dummy(is_max & is_ab_10, missing = 1)
    ) %>% 
  ungroup() %>% 
  select(-c(second_highest, max_value, is_max, is_ab_10))

Result 结果

# A tibble: 49 x 6
   gvkey auditor_fkey  year   sic percentage specialized
   <int>        <int> <int> <int>      <dbl>       <int>
 1 10390            2  2000   100     100              1
 2 10390            2  2000   100     100              1
 3  3107            2  2000  1000      37.1            0
 4  3107            2  2000  1000      37.1            0
 5 14590            5  2000  1000      60.7            1
 6 14590            5  2000  1000      60.7            1
 7 25776            1  2000  1000       2.22           0
 8 31626            2  2000  1000      37.1            0
 9 61811            5  2000  1000      60.7            1
10  5560            1  2000  1040     100              1
# … with 39 more rows

Data 数据

df <- structure(list(gvkey = c(10390L, 10390L, 3107L, 3107L, 14590L, 
                         14590L, 25776L, 31626L, 61811L, 5560L, 20488L, 65921L, 65921L, 
                         14793L, 14793L, 15274L, 15274L, 1266L, 10390L, 10390L, 3107L, 
                         3107L, 14590L, 14590L, 19565L, 19565L, 25776L, 31626L, 61811L, 
                         61811L, 64134L, 64134L, 5560L, 7881L, 9728L, 9728L, 12321L, 12321L, 
                         20488L, 65921L, 65921L, 102341L, 14793L, 14793L, 15274L, 15274L, 
                         142460L, 142460L, 142460L), auditor_fkey = c(2L, 2L, 2L, 2L, 
                                                                      5L, 5L, 1L, 2L, 5L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 4L, 2L, 2L, 
                                                                      2L, 2L, 5L, 5L, 1L, 1L, 1L, 2L, 5L, 5L, 580L, 580L, 7L, 5L, 597L, 
                                                                      597L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L), year = c(2000L, 
                                                                                                                                          2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 
                                                                                                                                          2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 
                                                                                                                                          2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 
                                                                                                                                          2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 
                                                                                                                                          2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 
                                                                                                                                          2001L, 2001L, 2001L), sic = c(100L, 100L, 1000L, 1000L, 1000L, 
                                                                                                                                                                        1000L, 1000L, 1000L, 1000L, 1040L, 1040L, 1040L, 1040L, 1220L, 
                                                                                                                                                                        1220L, 1220L, 1220L, 100L, 100L, 100L, 1000L, 1000L, 1000L, 1000L, 
                                                                                                                                                                        1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1040L, 
                                                                                                                                                                        1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 
                                                                                                                                                                        1220L, 1220L, 1220L, 1220L, 1220L, 1220L, 1220L), percentage = c(100, 
                                                                                                                                                                                                                                         100, 37.0939127, 37.0939127, 60.6862904, 60.6862904, 2.2197969, 
                                                                                                                                                                                                                                         37.0939127, 60.6862904, 100, 100, 100, 100, 34.7515456, 34.7515456, 
                                                                                                                                                                                                                                         65.2484544, 65.2484544, 26.9605909, 73.0394091, 73.0394091, 9.889969, 
                                                                                                                                                                                                                                         9.889969, 18.8287898, 18.8287898, 71.1457384, 71.1457384, 71.1457384, 
                                                                                                                                                                                                                                         9.889969, 18.8287898, 18.8287898, 0.1355028, 0.1355028, 8.2959428, 
                                                                                                                                                                                                                                         71.1026743, 1.0906007, 1.0906007, 18.1873703, 18.1873703, 18.1873703, 
                                                                                                                                                                                                                                         18.1873703, 18.1873703, 1.3234119, 58.0859392, 58.0859392, 41.9140608, 
                                                                                                                                                                                                                                         41.9140608, 58.0859392, 58.0859392, 58.0859392)), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                         -49L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                         "tbl", "data.frame"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM