简体   繁体   中英

R: Memory-Efficient Method of Getting Every Word Permutation within a String

I've a string with a list of words, and I want to get all the possible word combinations from it.

fruits <- "Apple Banana Cherry"

To get this output:

"Apple, Banana, Cherry, Apple Banana, Apple Cherry, Banana Cherry, Apple Banana Cherry"

Using the function defined here , slightly modified:

f1 <- function(str1){
  v1 <- strsplit(str1, ' ')[[1]]
  paste(unlist(sapply(seq(length(v1)), function(i)
    apply(combn(v1, i), 2, paste, collapse=" "))), collapse= ', ')
}

f1(fruits)

This works fine when there's relatively few rows, but the real-life example has a total of 93,300 characters across 3,350 rows with an median string length of 25 characters, causing an error similar to this:

Error in paste(unlist(sapply(seq(length(v1)), function(i) apply(combn(v1, : result would exceed 2^31-1 bytes

I tried changing utils::combn to RcppAlgos::comboGeneral within the function , because it's apparently quicker , but still encountered the same issue. Any suggestions for ways around this?

We have a pretty efficient function for vectorized skipgrams and ngrams in quanteda . Try this one, with multi-threading for efficiency (you can change the threads to your system's maximum):

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
quanteda_options(threads = 4)

fruits <- "Apple Banana Cherry"
tokens(fruits) %>%
  tokens_skipgrams(., n = seq_len(ntoken(.)), skip = 0:ntoken(.), concatenator = " ") %>%
  as.character() %>%
  paste(collapse = ", ")
## [1] "Apple, Banana, Cherry, Apple Banana, Apple Cherry, Banana Cherry, Apple Banana Cherry"

If you have three words

fruits <- "Apple Banana Cherry"

the combinations could be expressed by using a 0 or 1 to denote inclusion of each word. This would mean that with three words you have 2^3 - 1 = 7 options, excluding null:

001 Cherry
010 Banana
011 Banana, Cherry
100 Apple
101 Apple, Cherry
110 Apple, Banana
111 Apple, Banana, Cherry

So we can think of this as counting in binary. All three-word combinations can be expressed with three bits, and there are 2^3 - 1 = 7 options.

The problem with storing every combination is that the length of this list will double with every additional word. By the time you have 80 words, it will take 80 bits to express each possible combination, but there will be 2^80 - 1 = around 1,200,000,000,000,000,000,000,000 (1.2E24) different possible combinations, which would take more space than all the hard drives in the world.

I don't mean to imply that this is an unsolvable problem, and it's not my area of experience to judge whether the other answers will do what you want in an efficient way, but I just wanted to observe that there will be physical constraints making it impractical to pre-calculate and store all the possible combinations in the way the question proposes.

To keep it simple with the question, I omitted that what I was ultimately looking to do was to create a list of these combinations.

I also didn't know the name of this was tokenisation with a Skip-Gram . While ultimately still slow, this solution avoids the R memory error and with enough compute power, it does the trick:

library(tokenizers)
unlist(tokenize_skip_ngrams(fruits, n = 3, n_min = 1, k = 3))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM