比申請在矩陣中使用 function utf8ToInt 更快的替代方案

Question

我有一個尺寸為 9000000x10 的字符串矩陣（my_data），每個值都是一個字符串。 我想使用 function utf8ToInt將其轉換為數值矩陣，但這需要很長時間並且會導致我的 session 崩潰。

new_matrix <- apply(my_data, 1:2, "utf8ToInt")

結果是我所期望的，但我需要一種更有效的方法來做到這一點。

任何幫助都深表感謝。

想象一下我的數據是：

my_data <- matrix(c("a","b","c","d"), ncol = 2)

但它實際上是 9000000x10 而不是 2x2。

Answer 1

使用vapply速度幾乎是vapply兩倍。 由於vapply返回一個向量，因此需要重新建立矩陣格式（這里使用structure ）。

library(microbenchmark)

my_data <- matrix(sample(letters, 2*100, replace = TRUE), ncol = 2)

microbenchmark(
  apply  = apply(my_data, 1:2, utf8ToInt),
  vapply = structure(vapply(my_data, utf8ToInt, numeric(1)), dim=dim(my_data)),
  times = 500L, check = 'equal'
)
#> Unit: microseconds
#>    expr     min      lq    mean  median       uq      max neval
#>   apply 199.201 208.001 224.811 213.801 220.1515 1560.400   500
#>  vapply 111.000 115.501 136.343 120.401 124.9505 1525.901   500

^{由reprex 包(v1.0.0) 於 2021 年 3 月 6 日創建}

Answer 2

stringi::stri_enc_toutf32可能是另一種選擇。 來自?stri_enc_toutf32 ：

這個 function 大致相當於對utf8ToInt(enc2utf8(str))的矢量化調用

在 1e3 * 2 矩陣上， stri_enc_toutf32分別比vapply / apply + utf8ToInt快 10 倍和 20 倍：

library(stringi)
library(microbenchmark)

nr = 1e3
nc = 2

m = matrix(sample(letters, nr*nc, replace = TRUE), nrow = nr, ncol = nc)

microbenchmark(
  f_apply  = apply(m, 1:2, utf8ToInt),
  f_vapply = structure(vapply(m, utf8ToInt, numeric(1)), dim=dim(m)),
  f = matrix(unlist(stri_enc_toutf32(m), use.names = FALSE), nrow = nrow(m)),
  times = 10L, check = "equal")

# Unit: microseconds
#      expr    min     lq    mean  median     uq    max neval
#   f_apply 2283.4 2297.2 2351.17 2325.40 2354.5 2583.6    10
#  f_vapply 1276.1 1298.0 1348.88 1322.00 1353.4 1611.3    10
#         f   87.6   92.3  108.53  105.15  111.0  163.8    10

比申請在矩陣中使用 function utf8ToInt 更快的替代方案

問題描述

2 個解決方案

解決方案1
0 2021-03-06 12:50:36

解決方案2
0 2022-08-06 00:19:05

比申請在矩陣中使用 function utf8ToInt 更快的替代方案

問題描述

2 個解決方案

解決方案1 0 2021-03-06 12:50:36

解決方案2 0 2022-08-06 00:19:05

解決方案1
0 2021-03-06 12:50:36

解決方案2
0 2022-08-06 00:19:05