简体   繁体   English

R - data.table 使用正则表达式快速查找

[英]R - data.table fast lookup with regex

A data.table with two columns (3-grams and their counts) which has a key set on the ngrams column.具有两列(3-gram 及其计数)的 data.table,在 ngram 列上设置了一个键。 The 3-grams are a single character vector of three words separated by spaces. 3-gram 是由空格分隔的三个单词的单个字符向量。

set.seed(20182)

create.ngrams <- function(){
        w1 <- paste(sample(letters[1:5], 3, T), collapse = '')
        w2 <- paste(sample(letters[1:5], 3, T), collapse = '')
        w3 <- paste(sample(letters, 5, T), collapse = '')

        ngram <- paste(c(w1, w2, w3), collapse = " ")
        return(ngram)
}

dt <- data.table(ngrams = replicate(100000, create.ngrams()), N = sample.int(100, 100000, replace=T))

dt[ngrams %like% '^ada cab \\.*']

What I need to derive is, given a 2-gram, how many unique 3-grams appear in the 3-gram table with the 2-gram as the stem?我需要得出的是,给定一个 2-gram,在以 2-gram 为词干的 3-gram 表中出现了多少个唯一的 3-gram? The approach so far is to filter on the 3-gram table and getting a row count using regex expressions and the data.table %like% function.到目前为止的方法是过滤 3-gram 表并使用正则表达式和 data.table %like% function 获取行数。 Unfortunately, the documentation states that like doesn't make use of the table key.不幸的是, 文档指出like没有使用 table 键。

Note: Current implementation does not make use of sorted keys.注意:当前实现不使用排序键。

This slows the filtering down considerably:这大大减慢了过滤速度:

dt[ngrams %like% '^ada cab \\.*']

          ngrams  N
1: ada cab jsfzb 33
2: ada cab rbkqz 43
3: ada cab oyohg 10
4: ada cab dahtd 87
5: ada cab qgmfb  8
6: ada cab ylyfl 13
7: ada cab izeje 83
8: ada cab fukov 12

microbenchmark(dt[ngrams %like% '^ada cab \\.*']))

Unit: milliseconds
                                expr     min      lq     mean  median       uq     max neval
 dt[ngrams %like% "^ada cab \\\\.*"] 22.4061 23.9792 25.89883 25.0981 26.88145 34.7454   100

On the actual table I'm working with (nrow = 46856038), the performance is too slow to achieve the task I have:在我正在使用的实际表(nrow = 46856038)上,性能太慢而无法完成我的任务:

Unit: seconds
                              expr      min       lq     mean   median       uq      max neval
 t[ngrams %like% "^on the \\\\.*"] 10.48471 10.57198 11.27199 10.77015 10.94827 17.42804   100

Anything I could do to improve performance?我可以做些什么来提高性能? I tried working with dplyr a bit, but the gains didn't appear to be significant.我尝试使用dplyr一点,但收益似乎并不显着。

Are you able to go with fixed= patterns?你能用fixed=模式 go 吗? If you prepend a space to all ngram s, it gives you a virtual "word-boundary", allowing you to do a much faster pattern:如果你在所有ngram前面加上一个空格,它会给你一个虚拟的“单词边界”,让你做一个更快的模式:

dt[, ngrams1 := paste0(" ", ngrams)]
dt
#                ngrams  N        ngrams1
#      1: dcd aee vxfba 99  dcd aee vxfba
#      2: cad bec alsmv 92  cad bec alsmv
#      3: ebe edd zbogd 90  ebe edd zbogd
#      4: aac ace miexa 26  aac ace miexa
#      5: aea cda ppyii 67  aea cda ppyii
#     ---                                
#  99996: cca bbc xaezc 58  cca bbc xaezc
#  99997: ebc cae ktacb 95  ebc cae ktacb
#  99998: bed abe dpjmc 92  bed abe dpjmc
#  99999: dde cdb frkfz 79  dde cdb frkfz
# 100000: bed bce ydawa 52  bed bce ydawa

dt[ngrams %like% '^ada cab \\.*']
#           ngrams  N        ngrams1
# 1: ada cab qbbiw 22  ada cab qbbiw
# 2: ada cab kpejz 16  ada cab kpejz
# 3: ada cab lighh  4  ada cab lighh
# 4: ada cab rxpmc 64  ada cab rxpmc

dt[grepl(' ada cab ', ngrams1, fixed = TRUE),]
#           ngrams  N        ngrams1
# 1: ada cab qbbiw 22  ada cab qbbiw
# 2: ada cab kpejz 16  ada cab kpejz
# 3: ada cab lighh  4  ada cab lighh
# 4: ada cab rxpmc 64  ada cab rxpmc

In a benchmark, a fixed pattern is 3-4 times as fast:在基准测试中,固定模式的速度要快 3-4 倍:

microbenchmark::microbenchmark(
  a = dt[ngrams %like% '^ada cab \\.*'],
  b = dt[grepl('^ada cab', ngrams),],
  c = dt[ngrams1 %flike% ' ada cab ', ],
  d = dt[grepl(' ada cab ', ngrams1, fixed = TRUE),]
)
# Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval
#     a 20.299101 21.364401 22.088702 21.832000 22.444351 25.403801   100
#     b 20.605501 21.648101 22.656212 22.382001 23.384151 26.330201   100
#     c  4.337301  4.872151  5.265142  5.125251  5.500951  9.646201   100
#     d  4.301901  4.860501  5.221697  5.102000  5.465402  7.339400   100

This does not work if the pattern deviates from 3-3-5 (eg, if you have more 3s, where this might accidentally match other than the first couple 3s).如果模式偏离 3-3-5(例如,如果您有更多的 3,这可能会意外匹配而不是前几个 3),这将不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM