简体   繁体   中英

r data.table: Subsetting and assignment by reference in a for loop

Seems like an easy one, but... well...

Given a named vector of regular expressions and a data table as follows:

library(data.table)
regexes <- c(a="^A$") 
dt <- fread("
a,A,1
a,B,1
b,A,1
")

The input data table is

dt
#    V1 V2 V3
# 1:  a  A  1
# 2:  a  B  1
# 3:  b  A  1

My goal for the 1st element in regexes would be:

If V1=="a" set V3:=2 . EXCEPT when V2 matches the corresponding regular expression ^A$ , then V3:=3 .

( a is names(regexes)[1] , ^A$ is regexes[1] , 2 and 3 are just for demo purpose. I also got more names and regular expressions to loop over, and the data set is about 300.000 rows.)

So the expected output is

#    V1 V2 V3
# 1:  a  A  3 (*)
# 2:  a  B  2 (**)
# 3:  b  A  1

(*) 3 because V1 is a and V2 ( A ) matches the regex,
(**) 2 because V1 is a and V2 ( B ) does not match ^A$ .

I tried to loop through the regexes and pipe the subsetting through like this:

for (x in seq(regexes)) 
  dt[V1==names(regexes)[x], V3:=2][grepl(regexes[x], V2), V3:=3]

However...

dt
#    V1 V2 V3
# 1:  a  A  3 
# 2:  a  B  2
# 3:  b  A  3 <- wrong, should remain 2

... it does not work as expected, grepl uses the complete V2 column, not just the V1=="a" subset. I also tried some other things, which worked, but took too long (ie not the way to use data.table).

Question : What would be the best data table way to go here? I'm using packageVersion("data.table") '1.9.7' .


Note that I could go the data frame route eg like this

df <- as.data.frame(dt)
for (x in seq(regexes)) {
  idx <- df$V1==names(regexes)[x]
  df$V3[idx] <- 2
  df$V3[idx][grepl(regexes[x], df$V2[idx])] <- 3 # or ifelse()
}  

But - of course - I would not want to convert the data.table to a data.frame and then back to a data.table if possible.

Thanks in advance!

... it does not work as expected, grepl uses the complete V2 column, not just the V1=="a" subset.

I would use stringi, which allows for easy vectorization of regex tests:

library(stringi)
dt[V1 %in% names(regexes), 
  V3 := V3 + 1L + stri_detect(V2, regex = regexes[V1])
]

   V1 V2 V3
1:  a  A  3
2:  a  B  2
3:  b  A  1

The stri_detect family of functions are like grepl from base.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM