R 為 sapply() 尋找更快的替代方案

Question

我寫了一個 function 來計算一個句子中的單詞數（unigrams）：

library(ngram)
library(stringi)
library(tidyverse)
set.seed(123)

get_unigrams <- function(text) {
  sapply(text, function(text){
    unigram<-  ngram(text, n = 1) %>% get.ngrams() %>% length()
    
    return(unigram)
  }
  )
}

為此，我使用了sapply函數，該函數將我的get_unigrams函數應用於數據集中的每一行。
到目前為止，這也有效：

##example dataset:
df<-sample.int(5, 5, replace = T) %>% 
  map(.,  ~ stri_rand_strings(.x, 10) %>% paste(collapse = " ")) %>%
  unlist() %>% 
  tibble(text = .)

##applying my function
df %>% mutate(n=get_unigrams((text)))

# A tibble: 5 x 2
  text                                 n
  <chr>                            <int>
1 SxSgZ6tF2K xtgdzehXaH 9xtgn1TlDJ     3
2 E8PPM98ESG r2Rn7YC7kt Nf5NHoRoon     3
3 Rkdi0TDNbL 6FfPm6Qzts                2
4 A8eLeJBm5S VbKUxTtubP                2
5 9vI3wi8Yxa PeJJDMz958 gctfjWeomy     3

但是，由於get_unigrams函數適用於每一行，因此非常耗時。 因此，我想問一下 sapply -function 是否有一個快速的替代方案可以顯着加快我的sapply get_unigrams 。

##dataset with 50.000 rows:
df<-sample.int(50, 50000, replace = T) %>% 
  map(.,  ~ stri_rand_strings(.x, 10) %>% paste(collapse = " ")) %>%
  unlist() %>% 
  tibble(text = .)


system.time({
  df %>% mutate(n=get_unigrams((text)))
})

#      User      System verstrichen 
#     21.35        0.11       22.06

對於 50,000 行的數據集，我的 function 需要 22.06 秒（“verstrichen”）。 這對我來說顯然太多了！
有人可以幫我提高速度嗎？ 也許使用矢量化 function？

get_unigrams函數中的構造必須保持不變：

unigram <- ngram(text, n = 1) %>% get.ngrams() %>% length()    
return(unigram)

我只指sapply -函數。
提前謝謝了！

Answer 1

您可以通過將lapply替換為lfuture_apply來利用多個 CPU 內核：

library(dplyr)
library(future.apply)

my_slow_func <- function(x) {
  Sys.sleep(1)
  x + 1
}

data <- head(iris, 3)
data

system.time(
  mutate(data, a = Sepal.Length %>% map(my_slow_func))
)
#   user  system elapsed 
#  0.010   0.001   3.004 

plan(multisession)
chunks <- split(data, seq(3))
system.time(
  data$a <- future_lapply(chunks, function(x) my_slow_func(x$Sepal.Length))
)
#   user  system elapsed 
#  0.064   0.003   1.167

Answer 2

根據您可能想要考慮替代軟件包（而 ngram 聲稱很快）。 這里最快的選擇（當 ng = 1 時）是拆分單詞並找到唯一索引。

stringi_get_unigrams <- function(text)
  lengths(lapply(stri_split(text, fixed = " "), unique))

system.time(res3 <- stringi_get_unigrams(df$text))
#   user  system elapsed 
#   0.84    0.00    0.86

如果您想要更復雜（例如 ng,= 1），您需要比較字符串的所有成對組合。 這有點復雜。

stringi_get_duograms <- function(text){
  splits <- stri_split(text, fixed = " ")
  comp <- function(x)
    nrow(unique(matrix(c(x[-1], x[-length(x)]), ncol = 2)))
  res <- sapply(splits, comp)
  res[res == 0] <- NA_integer_
  res
}
system.time(res <- stringi_get_duograms(df$text))
#   user  system elapsed 
#   5.94    0.02    5.93

在這里，當特定單詞的語料庫中沒有匹配的單詞組合時，我們還有一個額外的好處，那就是不會崩潰。

我的 CPU 上的時間

system.time({
  res <- get_unigrams(df$text)
})
#   user  system elapsed 
#  12.72    0.16   12.94

替代並行實現：

get_unigrams_par <- function(text) {
  require(purrr)
  require(ngram)
  sapply(text, function(text)
    ngram(text, n = 1) %>% get.ngrams() %>% length()
  )
}
cl <- parallel::makeCluster(nc <- parallel::detectCores())
print(nc)
# [1] 12
system.time(
res2 <- unname(unlist(parallel::parLapply(cl, 
                                         split(df$text, 
                                               sort(1:nrow(df)%%nc)), 
                                         get_unigrams_par)))
)
#   user  system elapsed 
#   0.20    0.11    2.95 
parallel::stopCluster(cl)

只是為了檢查所有結果是否相同：

identical(unname(res), res2)
# TRUE
identical(res2, res3)
# TRUE

R 為 sapply() 尋找更快的替代方案

問題描述

2 個解決方案

解決方案1
0 2021-12-03 12:54:46

解決方案2
0 2021-12-03 14:39:54

R 為 sapply() 尋找更快的替代方案

問題描述

2 個解決方案

解決方案1 0 2021-12-03 12:54:46

解決方案2 0 2021-12-03 14:39:54

解決方案1
0 2021-12-03 12:54:46

解決方案2
0 2021-12-03 14:39:54