I am trying to create a new variable in a data.frame based on something like the following data:
df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L,
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01",
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12",
"D95"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01",
"", "A01", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010",
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018",
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))
I've managed with the following code for the new column condit
:
library(data.table)
cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df
However, I am working with a big dataset (>40 milion) and also want to categorize based on the letter in icpc
and icpc2
.
My goal is to add a new column which gives yes
or no
to there being a letter A
(so, A01
, A04
, A50
etc.) in either column icpc
or icpc2
. I also want all columns with the same id
to have yes
in the new column condit2
.
I was trying the following:
df2 <- setDT(df)[, condit2 := ifelse
(any(icpc %in% pmatch("K", df) | icpc2 %in% pmatch("K", df)), "yes","no"), by = PATNR]
head(df2)
This kept on running forever... (I guess, df is too bread anyway, if should have been df$icpc
and df$icpc2
?)
Than the following to check if pmatch
is suitable:
condit2 <- pmatch("K")
And then looked at something totally different:
library(sqldf)
condit2 <- sqldf("df$icpc | df$icpc2, '%K%'")
This should result in the following data frame:
id icpc icpc2 reg.date condit2
1: 123 D95 F15 19JUN2015 no
2: 123 F85 15AUG2016 no
3: 332 A01 16MAR2010 yes
4: 332 A04 20JAN2018 yes
5: 332 K20 20FEB2017 yes
6: 100 B10 01JUN2017 yes
7: 100 A04 11JAN2008 yes
8: 113 T08 18MAR2018 yes
9: 113 P28 19JAN2017 yes
10: 113 D95 A01 16JAN2013 yes
11: 113 A04 01MAY2009 yes
12: 551 B12 A01 03APR2011 yes
13: 551 D95 09MAY2015 yes
Can anyone give a hint? Thanks!!
setDT(df)
to_check <- 'A'
df[, condit2 := fifelse(any(grepl(to_check, icpc) | grepl(to_check, icpc2)),
'yes', 'no'),
by = id]
df
# id icpc icpc2 reg.date condit2
# 1: 123 D95 F15 19JUN2015 no
# 2: 123 F85 15AUG2016 no
# 3: 332 A01 16MAR2010 yes
# 4: 332 A04 20JAN2018 yes
# 5: 332 K20 20FEB2017 yes
# 6: 100 B10 01JUN2017 yes
# 7: 100 A04 11JAN2008 yes
# 8: 113 T08 18MAR2018 yes
# 9: 113 P28 19JAN2017 yes
# 10: 113 D95 A01 16JAN2013 yes
# 11: 113 A04 01MAY2009 yes
# 12: 551 B12 A01 03APR2011 yes
# 13: 551 D95 09MAY2015 yes
If, instead of just two columns icpc
and icpc2
, you have a bunch of them and don't want to type out the grepl
code for every one, here's version with .SDcols
which gives the same result.
df[, condit2 := fifelse(any(Reduce('|', lapply(.SD, grepl, patt = to_check))),
'yes', 'no'),
by = id, .SDcols = patterns('icpc')]
With dplyr
this can be done with the following method: group_by(id)
, paste
the two columns of interest together, and check whether at least one A
occurred in the concatenated string using sum
and grepl
.
library(dplyr)
df %>%
group_by(id) %>%
mutate(condit2 = case_when(sum(grep("A", paste(icpc, icpc2))) > 0 ~ "yes",
TRUE ~ "no")) %>%
ungroup()
id icpc icpc2 reg.date condit2
<int> <chr> <chr> <chr> <chr>
1 123 D95 "F15" 19JUN2015 no
2 123 F85 "" 15AUG2016 no
3 332 A01 "" 16MAR2010 yes
4 332 A04 "" 20JAN2018 yes
5 332 K20 "" 20FEB2017 yes
6 100 B10 "" 01JUN2017 yes
7 100 A04 "" 11JAN2008 yes
8 113 T08 "" 18MAR2018 yes
9 113 P28 "" 19JAN2017 yes
10 113 D95 "A01" 16JAN2013 yes
11 113 A04 "" 01MAY2009 yes
12 551 B12 "A01" 03APR2011 yes
13 551 D95 "" 09MAY2015 yes
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.