R 为 sapply() 寻找更快的替代方案

Question

I have written a function that counts the number of words (unigrams) in a sentence:我写了一个 function 来计算一个句子中的单词数（unigrams）：

library(ngram)
library(stringi)
library(tidyverse)
set.seed(123)

get_unigrams <- function(text) {
  sapply(text, function(text){
    unigram<-  ngram(text, n = 1) %>% get.ngrams() %>% length()
    
    return(unigram)
  }
  )
}

To do this, I used the sapply -function that applies my get_unigrams -function to each row in the data set.为此，我使用了sapply函数，该函数将我的get_unigrams函数应用于数据集中的每一行。
This also works so far:到目前为止，这也有效：

##example dataset:
df<-sample.int(5, 5, replace = T) %>% 
  map(.,  ~ stri_rand_strings(.x, 10) %>% paste(collapse = " ")) %>%
  unlist() %>% 
  tibble(text = .)

##applying my function
df %>% mutate(n=get_unigrams((text)))

# A tibble: 5 x 2
  text                                 n
  <chr>                            <int>
1 SxSgZ6tF2K xtgdzehXaH 9xtgn1TlDJ     3
2 E8PPM98ESG r2Rn7YC7kt Nf5NHoRoon     3
3 Rkdi0TDNbL 6FfPm6Qzts                2
4 A8eLeJBm5S VbKUxTtubP                2
5 9vI3wi8Yxa PeJJDMz958 gctfjWeomy     3

However, since the get_unigrams -function is applied for each row, this is very time-consuming.但是，由于get_unigrams函数适用于每一行，因此非常耗时。 Therefore, I would like to ask if there is an fast alternative for the sapply -function that speeds up my get_unigrams -function significantly.因此，我想问一下 sapply -function 是否有一个快速的替代方案可以显着加快我的sapply get_unigrams 。

##dataset with 50.000 rows:
df<-sample.int(50, 50000, replace = T) %>% 
  map(.,  ~ stri_rand_strings(.x, 10) %>% paste(collapse = " ")) %>%
  unlist() %>% 
  tibble(text = .)


system.time({
  df %>% mutate(n=get_unigrams((text)))
})

#      User      System verstrichen 
#     21.35        0.11       22.06

For a data set with 50,000 rows, my function needs 22.06 seconds ("verstrichen").对于 50,000 行的数据集，我的 function 需要 22.06 秒（“verstrichen”）。 This is clearly too much for me!这对我来说显然太多了！
Can someone help me increase the speed?有人可以帮我提高速度吗？ Maybe with a vectorised function?也许使用矢量化 function？

The construct within the get_unigrams -function must remain the same: get_unigrams函数中的构造必须保持不变：

unigram <- ngram(text, n = 1) %>% get.ngrams() %>% length()    
return(unigram)

I am only referring to the sapply -function.我只指sapply -函数。
Many thanks in advance!提前谢谢了！

Answer 1

You can utilize multiple CPU cores by replacing lapply with lfuture_apply :您可以通过将lapply替换为lfuture_apply来利用多个 CPU 内核：

library(dplyr)
library(future.apply)

my_slow_func <- function(x) {
  Sys.sleep(1)
  x + 1
}

data <- head(iris, 3)
data

system.time(
  mutate(data, a = Sepal.Length %>% map(my_slow_func))
)
#   user  system elapsed 
#  0.010   0.001   3.004 

plan(multisession)
chunks <- split(data, seq(3))
system.time(
  data$a <- future_lapply(chunks, function(x) my_slow_func(x$Sepal.Length))
)
#   user  system elapsed 
#  0.064   0.003   1.167

Answer 2

Depending on your might want to consider alternative packages (while ngram proclaims to be fast).根据您可能想要考虑替代软件包（而 ngram 声称很快）。 The fastest alternative here (while ng = 1) is to split the word and find unique indices.这里最快的选择（当 ng = 1 时）是拆分单词并找到唯一索引。

stringi_get_unigrams <- function(text)
  lengths(lapply(stri_split(text, fixed = " "), unique))

system.time(res3 <- stringi_get_unigrams(df$text))
#   user  system elapsed 
#   0.84    0.00    0.86

If you want to be more complex (eg. ng,= 1) you'd need to compare all pairwise combinations of string.如果您想要更复杂（例如 ng,= 1），您需要比较字符串的所有成对组合。 which is a bit more complex.这有点复杂。

stringi_get_duograms <- function(text){
  splits <- stri_split(text, fixed = " ")
  comp <- function(x)
    nrow(unique(matrix(c(x[-1], x[-length(x)]), ncol = 2)))
  res <- sapply(splits, comp)
  res[res == 0] <- NA_integer_
  res
}
system.time(res <- stringi_get_duograms(df$text))
#   user  system elapsed 
#   5.94    0.02    5.93

Here we have the added benefit of not crashing when there's no word combinations that are matching in the corpus of the specific words.在这里，当特定单词的语料库中没有匹配的单词组合时，我们还有一个额外的好处，那就是不会崩溃。

Times on my CPU我的 CPU 上的时间

system.time({
  res <- get_unigrams(df$text)
})
#   user  system elapsed 
#  12.72    0.16   12.94

alternative parallel implementation:替代并行实现：

get_unigrams_par <- function(text) {
  require(purrr)
  require(ngram)
  sapply(text, function(text)
    ngram(text, n = 1) %>% get.ngrams() %>% length()
  )
}
cl <- parallel::makeCluster(nc <- parallel::detectCores())
print(nc)
# [1] 12
system.time(
res2 <- unname(unlist(parallel::parLapply(cl, 
                                         split(df$text, 
                                               sort(1:nrow(df)%%nc)), 
                                         get_unigrams_par)))
)
#   user  system elapsed 
#   0.20    0.11    2.95 
parallel::stopCluster(cl)

And just to check that all results are identical:只是为了检查所有结果是否相同：

identical(unname(res), res2)
# TRUE
identical(res2, res3)
# TRUE

R 为 sapply() 寻找更快的替代方案

问题描述

2 个解决方案

解决方案1
0 2021-12-03 12:54:46

解决方案2
0 2021-12-03 14:39:54

R 为 sapply() 寻找更快的替代方案

问题描述

2 个解决方案

解决方案1 0 2021-12-03 12:54:46

解决方案2 0 2021-12-03 14:39:54

解决方案1
0 2021-12-03 12:54:46

解决方案2
0 2021-12-03 14:39:54