简体   繁体   中英

Faster alternative than apply for using function utf8ToInt in a matrix

I have a string matrix (my_data) of dimensions 9000000x10 with each value being a single character string. I want to transform it to a numeric matrix using the function utf8ToInt , but it takes a long time and crashes my session.

new_matrix <- apply(my_data, 1:2, "utf8ToInt")

The result is what I expect, but I need a more efficient way of doing that.

Any help is deeply appreciated.

Imagine my data is:

my_data <- matrix(c("a","b","c","d"), ncol = 2)

but it is actually 9000000x10 instead of 2x2.

Using vapply would be almost twice as fast. Since vapply returns a vector, it is necessary to re-establish the matrix format (here with structure ).

library(microbenchmark)

my_data <- matrix(sample(letters, 2*100, replace = TRUE), ncol = 2)

microbenchmark(
  apply  = apply(my_data, 1:2, utf8ToInt),
  vapply = structure(vapply(my_data, utf8ToInt, numeric(1)), dim=dim(my_data)),
  times = 500L, check = 'equal'
)
#> Unit: microseconds
#>    expr     min      lq    mean  median       uq      max neval
#>   apply 199.201 208.001 224.811 213.801 220.1515 1560.400   500
#>  vapply 111.000 115.501 136.343 120.401 124.9505 1525.901   500

Created on 2021-03-06 by the reprex package (v1.0.0)

stringi::stri_enc_toutf32 may be an alternative. From ?stri_enc_toutf32 :

This function is roughly equivalent to a vectorized call to utf8ToInt(enc2utf8(str))


On a 1e3 * 2 matrix, stri_enc_toutf32 is about 10 and 20 times faster than vapply / apply + utf8ToInt respectively:

library(stringi)
library(microbenchmark)

nr = 1e3
nc = 2

m = matrix(sample(letters, nr*nc, replace = TRUE), nrow = nr, ncol = nc)

microbenchmark(
  f_apply  = apply(m, 1:2, utf8ToInt),
  f_vapply = structure(vapply(m, utf8ToInt, numeric(1)), dim=dim(m)),
  f = matrix(unlist(stri_enc_toutf32(m), use.names = FALSE), nrow = nrow(m)),
  times = 10L, check = "equal")

# Unit: microseconds
#      expr    min     lq    mean  median     uq    max neval
#   f_apply 2283.4 2297.2 2351.17 2325.40 2354.5 2583.6    10
#  f_vapply 1276.1 1298.0 1348.88 1322.00 1353.4 1611.3    10
#         f   87.6   92.3  108.53  105.15  111.0  163.8    10

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM