[英]use ifelse statement in R data.table column calculation dependent on value first row
I have to do some regex on a large data tables (+30m rows) (actually many of these).我必须对大型数据表(+30m 行)(实际上很多)做一些正则表达式。 Where one columns is either just a repeated string (same for every row or missing) and other are different strings per row.
其中一列要么只是一个重复的字符串(每一行都相同或缺失),而另一列是每行不同的字符串。 Now, if that first column value is either missing or passes some other regex, I do not want to do the regex and just return FALSE, if it not missing I want to see if the columns match.
现在,如果第一列值丢失或传递了其他一些正则表达式,我不想执行正则表达式并只返回 FALSE,如果它没有丢失,我想查看列是否匹配。 This because I do need this for thousands of data.tables and because the regex takes a couple of seconds I would like to include an ifelse statement, where the regex does not even get attempted if the statement is FALSE.
这是因为我确实需要数千个 data.tables 并且因为正则表达式需要几秒钟,所以我想包含一个 ifelse 语句,如果该语句为 FALSE,则甚至不会尝试正则表达式。
this is what I attempted, but none of these work (I also tried fifelse
and if_else
这是我尝试过的,但这些都不起作用(我也尝试过
fifelse
和if_else
library(data.table)
set.seed(10)
data_table_test <-
data.table(col = rep("c", 1e6),
col2 = paste(
sample(letters, 1e6,
replace = T),
sample(letters, 1e6,
replace = T),
sep = ""
))
data_table_test2 <-
data.table(col = rep(NA, 1e6),
col2 = paste(
sample(letters, 1e6,
replace = T),
sample(letters, 1e6,
replace = T),
sep = ""
))
data_table_test[, ':='(matching_letter_1 = stringi::stri_detect_fixed(col2, col),
matching_letter_2 = ifelse(is.na(data_table_test[1, col ]), F, stringi::stri_detect_fixed(col2, col))),]
data_table_test2[, ':='(matching_letter_1 = stringi::stri_detect_fixed(col2, col),
matching_letter_2 = ifelse(is.na(data_table_test2[1, col ]), F, stringi::stri_detect_fixed(col2, col))),]
This does work, but is slower这确实有效,但速度较慢
data_table_test2[, ':='(matching_letter_1 = stringi::stri_detect_fixed(col2, col)), ][, ':='(matching_letter_1 = fifelse(is.na(matching_letter_1), F, matching_letter_1)), ]
EDIT The expected output would be something should be something like this编辑预期的 output 应该是这样的
data_table_test[matching_letter_1 == TRUE]
should be the same as应该是一样的
data_table_test[matching_letter_2 == TRUE]
and和
data_table_test2[matching_letter_1 == TRUE]
should be the same as (both empty data.tables)应该与(都是空的data.tables)相同
data_table_test2[matching_letter_2 == TRUE]
A slow, but functional tidyverse solution would be this:一个缓慢但实用的 tidyverse 解决方案是:
data_table_test %>%
as_tibble() %>%
rowwise() %>%
mutate(matching_letter = ifelse(is.na(data_table_test$col[1]), F, stringi::stri_detect_fixed(col2, col))) %>%
filter(matching_letter)
# A tibble: 75,772 x 3
# Rowwise:
col col2 matching_letter
<chr> <chr> <lgl>
1 c cb TRUE
2 c ce TRUE
3 c yc TRUE
4 c ch TRUE
5 c ic TRUE
6 c gc TRUE
7 c cg TRUE
8 c lc TRUE
9 c ci TRUE
10 c zc TRUE
# ... with 75,762 more rows
and和
data_table_test2 %>%
as_tibble() %>%
rowwise() %>%
mutate(matching_letter = ifelse(is.na(data_table_test2$col[1]), F, stringi::stri_detect_fixed(col2, col))) %>%
filter(matching_letter)
# A tibble: 0 x 3
# Rowwise:
# ... with 3 variables: col <lgl>, col2 <chr>, matching_letter <lgl>
EDIT 2 This code would do the trick, but is not the solution I need, because I need to test for many combinations of columns.编辑 2这段代码可以解决问题,但不是我需要的解决方案,因为我需要测试许多列组合。 I need the if statement inside of the data.table operation
我需要 data.table 操作中的 if 语句
if(is.na(data_table_test[1, col ])){
data_table_test[, matching_letter := F, ]
}else{
data_table_test[, matching_letter_1 := stringi::stri_detect_fixed(col2, col),]
}
I do not have tidyverse to compare expected output against to, please include expected output produced without such heavy dependencies.我没有 tidyverse 来比较预期的 output 与,请包括预期的 output 产生的没有如此严重的依赖关系。
setmatchingletter = function(x) {
stopifnot(nrow(x)>0L, c("col","col2")%in%names(x))
v = if (is.na(x$col[1L])) FALSE else {
stringi::stri_detect_fixed(x$col2, x$col)
}
set(x, , "matching_letter", v)
}
setmatchingletter(data_table_test)
data_table_test[matching_letter==TRUE]
setmatchingletter(data_table_test2)
data_table_test2[matching_letter==TRUE]
This solution assumes that stringi::stri_detect_fixed
is "vectorized", unlike the use of it in the question.此解决方案假定
stringi::stri_detect_fixed
是“矢量化的”,这与在问题中使用它不同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.