Seems like an easy one, but... well...
Given a named vector of regular expressions and a data table as follows:
library(data.table)
regexes <- c(a="^A$")
dt <- fread("
a,A,1
a,B,1
b,A,1
")
The input data table is
dt
# V1 V2 V3
# 1: a A 1
# 2: a B 1
# 3: b A 1
My goal for the 1st element in regexes
would be:
If V1=="a"
set V3:=2
. EXCEPT when V2
matches the corresponding regular expression ^A$
, then V3:=3
.
( a
is names(regexes)[1]
, ^A$
is regexes[1]
, 2
and 3
are just for demo purpose. I also got more names and regular expressions to loop over, and the data set is about 300.000 rows.)
So the expected output is
# V1 V2 V3
# 1: a A 3 (*)
# 2: a B 2 (**)
# 3: b A 1
(*) 3
because V1
is a
and V2
( A
) matches the regex,
(**) 2
because V1
is a
and V2
( B
) does not match ^A$
.
I tried to loop through the regexes and pipe the subsetting through like this:
for (x in seq(regexes))
dt[V1==names(regexes)[x], V3:=2][grepl(regexes[x], V2), V3:=3]
However...
dt
# V1 V2 V3
# 1: a A 3
# 2: a B 2
# 3: b A 3 <- wrong, should remain 2
... it does not work as expected, grepl
uses the complete V2
column, not just the V1=="a"
subset. I also tried some other things, which worked, but took too long (ie not the way to use data.table).
Question : What would be the best data table way to go here? I'm using packageVersion("data.table")
'1.9.7'
.
Note that I could go the data frame route eg like this
df <- as.data.frame(dt)
for (x in seq(regexes)) {
idx <- df$V1==names(regexes)[x]
df$V3[idx] <- 2
df$V3[idx][grepl(regexes[x], df$V2[idx])] <- 3 # or ifelse()
}
But - of course - I would not want to convert the data.table to a data.frame and then back to a data.table if possible.
Thanks in advance!
... it does not work as expected,
grepl
uses the completeV2
column, not just theV1=="a"
subset.
I would use stringi, which allows for easy vectorization of regex tests:
library(stringi)
dt[V1 %in% names(regexes),
V3 := V3 + 1L + stri_detect(V2, regex = regexes[V1])
]
V1 V2 V3
1: a A 3
2: a B 2
3: b A 1
The stri_detect
family of functions are like grepl
from base.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.