A data.table with two columns (3-grams and their counts) which has a key set on the ngrams column. The 3-grams are a single character vector of three words separated by spaces.
set.seed(20182)
create.ngrams <- function(){
w1 <- paste(sample(letters[1:5], 3, T), collapse = '')
w2 <- paste(sample(letters[1:5], 3, T), collapse = '')
w3 <- paste(sample(letters, 5, T), collapse = '')
ngram <- paste(c(w1, w2, w3), collapse = " ")
return(ngram)
}
dt <- data.table(ngrams = replicate(100000, create.ngrams()), N = sample.int(100, 100000, replace=T))
dt[ngrams %like% '^ada cab \\.*']
What I need to derive is, given a 2-gram, how many unique 3-grams appear in the 3-gram table with the 2-gram as the stem? The approach so far is to filter on the 3-gram table and getting a row count using regex expressions and the data.table %like%
function. Unfortunately, the documentation states that like
doesn't make use of the table key.
Note: Current implementation does not make use of sorted keys.
This slows the filtering down considerably:
dt[ngrams %like% '^ada cab \\.*']
ngrams N
1: ada cab jsfzb 33
2: ada cab rbkqz 43
3: ada cab oyohg 10
4: ada cab dahtd 87
5: ada cab qgmfb 8
6: ada cab ylyfl 13
7: ada cab izeje 83
8: ada cab fukov 12
microbenchmark(dt[ngrams %like% '^ada cab \\.*']))
Unit: milliseconds
expr min lq mean median uq max neval
dt[ngrams %like% "^ada cab \\\\.*"] 22.4061 23.9792 25.89883 25.0981 26.88145 34.7454 100
On the actual table I'm working with (nrow = 46856038), the performance is too slow to achieve the task I have:
Unit: seconds
expr min lq mean median uq max neval
t[ngrams %like% "^on the \\\\.*"] 10.48471 10.57198 11.27199 10.77015 10.94827 17.42804 100
Anything I could do to improve performance? I tried working with dplyr
a bit, but the gains didn't appear to be significant.
Are you able to go with fixed=
patterns? If you prepend a space to all ngram
s, it gives you a virtual "word-boundary", allowing you to do a much faster pattern:
dt[, ngrams1 := paste0(" ", ngrams)]
dt
# ngrams N ngrams1
# 1: dcd aee vxfba 99 dcd aee vxfba
# 2: cad bec alsmv 92 cad bec alsmv
# 3: ebe edd zbogd 90 ebe edd zbogd
# 4: aac ace miexa 26 aac ace miexa
# 5: aea cda ppyii 67 aea cda ppyii
# ---
# 99996: cca bbc xaezc 58 cca bbc xaezc
# 99997: ebc cae ktacb 95 ebc cae ktacb
# 99998: bed abe dpjmc 92 bed abe dpjmc
# 99999: dde cdb frkfz 79 dde cdb frkfz
# 100000: bed bce ydawa 52 bed bce ydawa
dt[ngrams %like% '^ada cab \\.*']
# ngrams N ngrams1
# 1: ada cab qbbiw 22 ada cab qbbiw
# 2: ada cab kpejz 16 ada cab kpejz
# 3: ada cab lighh 4 ada cab lighh
# 4: ada cab rxpmc 64 ada cab rxpmc
dt[grepl(' ada cab ', ngrams1, fixed = TRUE),]
# ngrams N ngrams1
# 1: ada cab qbbiw 22 ada cab qbbiw
# 2: ada cab kpejz 16 ada cab kpejz
# 3: ada cab lighh 4 ada cab lighh
# 4: ada cab rxpmc 64 ada cab rxpmc
In a benchmark, a fixed pattern is 3-4 times as fast:
microbenchmark::microbenchmark(
a = dt[ngrams %like% '^ada cab \\.*'],
b = dt[grepl('^ada cab', ngrams),],
c = dt[ngrams1 %flike% ' ada cab ', ],
d = dt[grepl(' ada cab ', ngrams1, fixed = TRUE),]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# a 20.299101 21.364401 22.088702 21.832000 22.444351 25.403801 100
# b 20.605501 21.648101 22.656212 22.382001 23.384151 26.330201 100
# c 4.337301 4.872151 5.265142 5.125251 5.500951 9.646201 100
# d 4.301901 4.860501 5.221697 5.102000 5.465402 7.339400 100
This does not work if the pattern deviates from 3-3-5 (eg, if you have more 3s, where this might accidentally match other than the first couple 3s).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.