Create new variable based on partial matching in column R

Question

I am trying to create a new variable in a data.frame based on something like the following data:

df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01", 
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", 
"D95"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01", 
"", "A01", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010", 
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018", 
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))

I've managed with the following code for the new column condit :

library(data.table)

cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df

However, I am working with a big dataset (>40 milion) and also want to categorize based on the letter in icpc and icpc2 .

My goal is to add a new column which gives yes or no to there being a letter A (so, A01 , A04 , A50 etc.) in either column icpc or icpc2 . I also want all columns with the same id to have yes in the new column condit2 .

I was trying the following:

df2 <- setDT(df)[, condit2 := ifelse
                            (any(icpc %in% pmatch("K", df) | icpc2 %in% pmatch("K", df)), "yes","no"), by = PATNR]
head(df2)

This kept on running forever... (I guess, df is too bread anyway, if should have been df$icpc and df$icpc2 ?)

Than the following to check if pmatch is suitable:

condit2 <- pmatch("K")

And then looked at something totally different:

library(sqldf)
condit2 <- sqldf("df$icpc | df$icpc2, '%K%'")

This should result in the following data frame:

    id  icpc icpc2 reg.date    condit2
 1: 123  D95   F15 19JUN2015    no
 2: 123  F85       15AUG2016    no
 3: 332  A01       16MAR2010    yes
 4: 332  A04       20JAN2018    yes
 5: 332  K20       20FEB2017    yes
 6: 100  B10       01JUN2017    yes
 7: 100  A04       11JAN2008    yes
 8: 113  T08       18MAR2018    yes
 9: 113  P28       19JAN2017    yes
10: 113  D95   A01 16JAN2013    yes
11: 113  A04       01MAY2009    yes
12: 551  B12   A01 03APR2011    yes
13: 551  D95       09MAY2015    yes

Can anyone give a hint? Thanks!!

Answer 1

setDT(df)

to_check <- 'A'

df[, condit2 := fifelse(any(grepl(to_check, icpc) | grepl(to_check, icpc2)),
                        'yes', 'no'), 
   by = id]

df
#      id icpc icpc2  reg.date condit2
#  1: 123  D95   F15 19JUN2015      no
#  2: 123  F85       15AUG2016      no
#  3: 332  A01       16MAR2010     yes
#  4: 332  A04       20JAN2018     yes
#  5: 332  K20       20FEB2017     yes
#  6: 100  B10       01JUN2017     yes
#  7: 100  A04       11JAN2008     yes
#  8: 113  T08       18MAR2018     yes
#  9: 113  P28       19JAN2017     yes
# 10: 113  D95   A01 16JAN2013     yes
# 11: 113  A04       01MAY2009     yes
# 12: 551  B12   A01 03APR2011     yes
# 13: 551  D95       09MAY2015     yes

If, instead of just two columns icpc and icpc2 , you have a bunch of them and don't want to type out the grepl code for every one, here's version with .SDcols which gives the same result.

df[, condit2 := fifelse(any(Reduce('|', lapply(.SD, grepl, patt = to_check))),
                        'yes', 'no'), 
   by = id, .SDcols = patterns('icpc')]

Answer 2

With dplyr this can be done with the following method: group_by(id) , paste the two columns of interest together, and check whether at least one A occurred in the concatenated string using sum and grepl .

library(dplyr)
df %>% 
  group_by(id) %>% 
  mutate(condit2 = case_when(sum(grep("A", paste(icpc, icpc2))) > 0 ~ "yes",
                             TRUE ~ "no")) %>% 
  ungroup()


      id icpc  icpc2 reg.date  condit2
   <int> <chr> <chr> <chr>     <chr>  
 1   123 D95   "F15" 19JUN2015 no     
 2   123 F85   ""    15AUG2016 no     
 3   332 A01   ""    16MAR2010 yes    
 4   332 A04   ""    20JAN2018 yes    
 5   332 K20   ""    20FEB2017 yes    
 6   100 B10   ""    01JUN2017 yes    
 7   100 A04   ""    11JAN2008 yes    
 8   113 T08   ""    18MAR2018 yes    
 9   113 P28   ""    19JAN2017 yes    
10   113 D95   "A01" 16JAN2013 yes    
11   113 A04   ""    01MAY2009 yes    
12   551 B12   "A01" 03APR2011 yes    
13   551 D95   ""    09MAY2015 yes

Create new variable based on partial matching in column R

Question

2 answers

solution1
3 ACCPTED 2020-04-01 18:00:33

solution2
2 2020-04-01 17:57:29

Create new variable based on partial matching in column R

Question

2 answers

solution1 3 ACCPTED 2020-04-01 18:00:33

solution2 2 2020-04-01 17:57:29

solution1
3 ACCPTED 2020-04-01 18:00:33

solution2
2 2020-04-01 17:57:29