简体   繁体   中英

Create a categorical variable based on two columns in data.table R

Following data.table

df <- data.table(id=c(1,2,3,4,5,6,7,8,9,10),
                 var1=c(0,4,5,6,99,3,5,5,23,0),
                 var2=c(22,4,6,25,6,70,75,23,24,21))
id var1 var2
1:  1    0   22
2:  2    4    4
3:  3    5    6
4:  4    6   25
5:  5   99    6
6:  6    3   70
7:  7    5   75
8:  8    5   23
9:  9   23   24
10: 10    0   21

I want to create a binary variable that is either 'yes' for any number different than 00 or 99 in var1 and/or any number between 20 and 29 in var2 , or 'no'. The result is the following

id var1 var2 cat
1:  1    0   22 yes
2:  2    4    4 yes
3:  3    5    6 yes
4:  4    6   25 yes
5:  5   99    6  no
6:  6    3   70 yes
7:  7    5   75 yes
8:  8    0   23 yes
9:  9   99   24 yes
10: 10    0    0  no

The original data.table is much more larger with thousands of rows. The target values for 'yes' in var2 are multiple random values that are not connected with each other, so likely I will have to type them manually with c() . I appreciate a help in data.table . So far, I tried using %in% but don't know how to apply it on two columns. Before, I have used it for one column only. Thanks!

You can just use data.table's fast ifelse. I split it up to be able to more easily read what's going on. You have to use some boolean logic to get what you want.

You need to take your first condition (not in 0 or 99) and use the | (or) operator to get either a true if true in either var1 or var 2 condition and then you have to & the var 1 condition such that any 0 or 99 in the first column will produce a false regardless of var2. This is condition2 below.

It's not clear what you want. The second condition appears to be what you want but because your results don't match your input data I cannot be sure. You also said and/or which doesn't really make sense in a boolean context (it's one or the other).

    not_zero_nn <- !(df$var1 %in% c(0, 99))
    condition <- not_zero_nn | (df$var2 %in% 20:29) 
    condition2 <- condition & not_zero_nn
    
    df[, cat := fifelse(condition, 'yes', 'no')]
    id var1 var2 cat
    # 1:  1    0   22 yes
    # 2:  2    4    4 yes
    # 3:  3    5    6 yes
    # 4:  4    6   25 yes
    # 5:  5   99    6  no
    # 6:  6    3   70 yes
    # 7:  7    5   75 yes
    # 8:  8    5   23 yes
    # 9:  9   23   24 yes
    # 10: 10    0   21 yes
    
    df[, cat := fifelse(condition2, 'yes', 'no')]
    id var1 var2 cat
    # 1:  1    0   22  no
    # 2:  2    4    4 yes
    # 3:  3    5    6 yes
    # 4:  4    6   25 yes
    # 5:  5   99    6  no
    # 6:  6    3   70 yes
    # 7:  7    5   75 yes
    # 8:  8    5   23 yes
    # 9:  9   23   24 yes
    # 10: 10    0   21  no

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM