简体   繁体   English

在 R data.table 列计算中使用 ifelse 语句取决于值第一行

[英]use ifelse statement in R data.table column calculation dependent on value first row

I have to do some regex on a large data tables (+30m rows) (actually many of these).我必须对大型数据表(+30m 行)(实际上很多)做一些正则表达式。 Where one columns is either just a repeated string (same for every row or missing) and other are different strings per row.其中一列要么只是一个重复的字符串(每一行都相同或缺失),而另一列是每行不同的字符串。 Now, if that first column value is either missing or passes some other regex, I do not want to do the regex and just return FALSE, if it not missing I want to see if the columns match.现在,如果第一列值丢失或传递了其他一些正则表达式,我不想执行正则表达式并只返回 FALSE,如果它没有丢失,我想查看列是否匹配。 This because I do need this for thousands of data.tables and because the regex takes a couple of seconds I would like to include an ifelse statement, where the regex does not even get attempted if the statement is FALSE.这是因为我确实需要数千个 data.tables 并且因为正则表达式需要几秒钟,所以我想包含一个 ifelse 语句,如果该语句为 FALSE,则甚至不会尝试正则表达式。

this is what I attempted, but none of these work (I also tried fifelse and if_else这是我尝试过的,但这些都不起作用(我也尝试过fifelseif_else

library(data.table)
set.seed(10)
data_table_test <-
  data.table(col  = rep("c", 1e6),
             col2 =  paste(
               sample(letters, 1e6,
                      replace = T),
               sample(letters, 1e6,
                      replace = T),
               sep = ""
             ))

data_table_test2 <-
  data.table(col  = rep(NA, 1e6),
             col2 =  paste(
               sample(letters, 1e6,
                      replace = T),
               sample(letters, 1e6,
                      replace = T),
               sep = ""
             ))


data_table_test[, ':='(matching_letter_1   = stringi::stri_detect_fixed(col2, col),
                       matching_letter_2   = ifelse(is.na(data_table_test[1, col ]), F, stringi::stri_detect_fixed(col2, col))),]

data_table_test2[, ':='(matching_letter_1   = stringi::stri_detect_fixed(col2, col),
                       matching_letter_2   = ifelse(is.na(data_table_test2[1, col ]), F, stringi::stri_detect_fixed(col2, col))),]

This does work, but is slower这确实有效,但速度较慢

data_table_test2[, ':='(matching_letter_1   = stringi::stri_detect_fixed(col2, col)), ][, ':='(matching_letter_1 = fifelse(is.na(matching_letter_1),  F, matching_letter_1)), ]

EDIT The expected output would be something should be something like this编辑预期的 output 应该是这样的

data_table_test[matching_letter_1 == TRUE]

should be the same as应该是一样的

data_table_test[matching_letter_2 == TRUE]

and

data_table_test2[matching_letter_1 == TRUE]

should be the same as (both empty data.tables)应该与(都是空的data.tables)相同

data_table_test2[matching_letter_2 == TRUE]

A slow, but functional tidyverse solution would be this:一个缓慢但实用的 tidyverse 解决方案是:

data_table_test %>%
  as_tibble() %>%
  rowwise() %>%
  mutate(matching_letter = ifelse(is.na(data_table_test$col[1]), F, stringi::stri_detect_fixed(col2, col))) %>%
  filter(matching_letter)


# A tibble: 75,772 x 3
# Rowwise: 
   col   col2  matching_letter
   <chr> <chr> <lgl>          
 1 c     cb    TRUE           
 2 c     ce    TRUE           
 3 c     yc    TRUE           
 4 c     ch    TRUE           
 5 c     ic    TRUE           
 6 c     gc    TRUE           
 7 c     cg    TRUE           
 8 c     lc    TRUE           
 9 c     ci    TRUE           
10 c     zc    TRUE           
# ... with 75,762 more rows

and


data_table_test2 %>%
  as_tibble() %>%
  rowwise() %>%
  mutate(matching_letter = ifelse(is.na(data_table_test2$col[1]), F, stringi::stri_detect_fixed(col2, col))) %>%
  filter(matching_letter)



# A tibble: 0 x 3
# Rowwise: 
# ... with 3 variables: col <lgl>, col2 <chr>, matching_letter <lgl>

EDIT 2 This code would do the trick, but is not the solution I need, because I need to test for many combinations of columns.编辑 2这段代码可以解决问题,但不是我需要的解决方案,因为我需要测试许多列组合。 I need the if statement inside of the data.table operation我需要 data.table 操作中的 if 语句

if(is.na(data_table_test[1, col ])){
  data_table_test[, matching_letter := F, ]
}else{
  data_table_test[, matching_letter_1 := stringi::stri_detect_fixed(col2, col),]
}

I do not have tidyverse to compare expected output against to, please include expected output produced without such heavy dependencies.我没有 tidyverse 来比较预期的 output 与,请包括预期的 output 产生的没有如此严重的依赖关系。

setmatchingletter = function(x) {
  stopifnot(nrow(x)>0L, c("col","col2")%in%names(x))
  v = if (is.na(x$col[1L])) FALSE else {
    stringi::stri_detect_fixed(x$col2, x$col)
  }
  set(x, , "matching_letter", v)
}

setmatchingletter(data_table_test)
data_table_test[matching_letter==TRUE]

setmatchingletter(data_table_test2)
data_table_test2[matching_letter==TRUE]

This solution assumes that stringi::stri_detect_fixed is "vectorized", unlike the use of it in the question.此解决方案假定stringi::stri_detect_fixed是“矢量化的”,这与在问题中使用它不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM