简体   繁体   中英

Fastest way to convert a list of character vectors to numeric in R

In R , what is the fastest way to convert a list containing suites of character numbers (as character vectors) into numeric?

With the following dummy data:

set.seed(2)
N = 1e7
ncol = 10
myT = formatC(matrix(runif(N), ncol = ncol)) # A matrix converted to characters
# Each row is collapsed into a single suite of characters:
myT = apply(myT, 1, function(x) paste(x, collapse=' ') ) 
head(myT)

Producing:

[1] "0.1849 0.855 0.8272 0.5403 0.3891 0.5184 0.7776 0.5533 0.1566 0.01591"  
[2] "0.7024 0.1008 0.9442 0.8582 0.3184 0.9289 0.9957 0.1311 0.2131 0.07355" 
[3] "0.5733 0.5493 0.3915 0.4423 0.8522 0.6042 0.9265 0.006878 0.7052 0.71"   
[... etc ...] 

I could do

library(stringi) 
# In the actual dataset, the number of spaces between numbers may vary, hence "\\s+"
system.time(newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)) 
newT <- unlist(newT) # Final goal is to have a single vector of numbers

On my Intel Core i7 2.10GHz with 64-bit and 16GB system (under ubuntu):

   user  system elapsed 
  3.748   0.008   3.757 

With the real dataset ( ncol=150 and N~1e9 ), this is way too long. Any better option?

This is twice as fast on my system:

x <- paste(myT, collapse = "\n")
library(data.table)
DT <- fread(x)
newT2 <- c(t(DT))

I would suggest the "iotools" package, specifically the mstrsplit function. With that you would just do:

library(iotools)
newT <- as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))

Get the "iotools" package on GitHub .


Timing comparisons:

OPFun <- function(myT) {
  newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)
  unlist(newT)
}

RolandFun <- function(myT) {
  x <- paste(myT, collapse = "\n")
  DT <- fread(x)
  newT2 <- c(t(DT))
  newT2
}

AMFun <- function(myT) {
  as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
}

system.time(OP <- OPFun(myT))
#    user  system elapsed 
#   3.920   0.004   3.917 
system.time(Roland <- RolandFun(myT))
#    user  system elapsed 
#   3.156   0.020   3.175 
system.time(AM <- AMFun(myT))
#    user  system elapsed 
#   0.664   0.016   0.676 

all.equal(OP, Roland)
# [1] TRUE
all.equal(Roland, AM)
# [1] TRUE

mstrsplit(myT, sep = " ", type = "numeric")[, 1] is marginally faster. Note that the order of doing things influences performance. unlist(lapply(x, as.numeric)) is slower than as.numeric(unlist(x))

set.seed(2)
N = 1e4
ncol = 10
myT = formatC(matrix(runif(N), ncol = ncol)) # A matrix converted to characters
myT = apply(myT, 1, function(x) paste(x, collapse=' ') ) 
head(myT)

library(microbenchmark)
library(stringi) 
library(data.table)
library(iotools)
microbenchmark(
  original = {
    newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)
    unlist(newT)
  },
  data.table = {
    x <- paste(myT, collapse = "\n")
    DT <- fread(x)
    c(t(DT))
  },
  iotools = {
    as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
  },
  strsplit = {
    as.numeric(unlist(strsplit(myT, " ")))
  },
  original2 = {
     as.numeric(unlist(stri_split_regex(myT, "\\s+", omit_empty = TRUE)))
  },
  iotools2 = {
    mstrsplit(myT, sep = " ", type = "numeric")[, 1]
  }
)
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval   cld
   original 52.03538 53.56949 56.02025 54.27165 55.40487  94.45513   100   c  
 data.table 93.10810 94.63730 98.04845 95.41537 96.51202 212.66666   100     e
    iotools 18.73776 19.44485 21.00974 19.75573 20.05614  42.47620   100 a    
   strsplit 67.04637 69.24053 70.58916 69.86529 70.95980  84.86132   100    d 
  original2 48.25558 49.47346 51.49833 50.14377 50.96139  84.22928   100  b   
   iotools2 18.53165 19.19126 19.72922 19.52567 19.71340  32.48726   100 a    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM