[英]Fastest way to convert a list of character vectors to numeric in R
In R
, what is the fastest way to convert a list containing suites of character numbers (as character vectors) into numeric? 在
R
,将包含字符数字套件(作为字符向量)的列表转换为数字的最快方法是什么?
With the following dummy data: 使用以下虚拟数据:
set.seed(2)
N = 1e7
ncol = 10
myT = formatC(matrix(runif(N), ncol = ncol)) # A matrix converted to characters
# Each row is collapsed into a single suite of characters:
myT = apply(myT, 1, function(x) paste(x, collapse=' ') )
head(myT)
Producing: 生产:
[1] "0.1849 0.855 0.8272 0.5403 0.3891 0.5184 0.7776 0.5533 0.1566 0.01591"
[2] "0.7024 0.1008 0.9442 0.8582 0.3184 0.9289 0.9957 0.1311 0.2131 0.07355"
[3] "0.5733 0.5493 0.3915 0.4423 0.8522 0.6042 0.9265 0.006878 0.7052 0.71"
[... etc ...]
I could do 我可以
library(stringi)
# In the actual dataset, the number of spaces between numbers may vary, hence "\\s+"
system.time(newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric))
newT <- unlist(newT) # Final goal is to have a single vector of numbers
On my Intel Core i7 2.10GHz with 64-bit and 16GB system (under ubuntu): 在我的英特尔酷睿i7 2.10GHz上配备64位和16GB系统(在ubuntu下):
user system elapsed
3.748 0.008 3.757
With the real dataset ( ncol=150
and N~1e9
), this is way too long. 使用真实数据集(
N~1e9
ncol=150
和N~1e9
),这太长了。 Any better option? 有更好的选择吗?
This is twice as fast on my system: 这是我系统的两倍:
x <- paste(myT, collapse = "\n")
library(data.table)
DT <- fread(x)
newT2 <- c(t(DT))
I would suggest the "iotools" package, specifically the mstrsplit
function. 我建议使用“iotools”软件包,特别是
mstrsplit
函数。 With that you would just do: 你可以这样做:
library(iotools)
newT <- as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
Get the "iotools" package on GitHub . 在GitHub上获取“iotools”包。
Timing comparisons: 时间比较:
OPFun <- function(myT) {
newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)
unlist(newT)
}
RolandFun <- function(myT) {
x <- paste(myT, collapse = "\n")
DT <- fread(x)
newT2 <- c(t(DT))
newT2
}
AMFun <- function(myT) {
as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
}
system.time(OP <- OPFun(myT))
# user system elapsed
# 3.920 0.004 3.917
system.time(Roland <- RolandFun(myT))
# user system elapsed
# 3.156 0.020 3.175
system.time(AM <- AMFun(myT))
# user system elapsed
# 0.664 0.016 0.676
all.equal(OP, Roland)
# [1] TRUE
all.equal(Roland, AM)
# [1] TRUE
mstrsplit(myT, sep = " ", type = "numeric")[, 1]
is marginally faster. mstrsplit(myT, sep = " ", type = "numeric")[, 1]
略快。 Note that the order of doing things influences performance. 请注意,服务顺序会影响性能。
unlist(lapply(x, as.numeric))
is slower than as.numeric(unlist(x))
unlist(lapply(x, as.numeric))
比as.numeric(unlist(x))
慢
set.seed(2)
N = 1e4
ncol = 10
myT = formatC(matrix(runif(N), ncol = ncol)) # A matrix converted to characters
myT = apply(myT, 1, function(x) paste(x, collapse=' ') )
head(myT)
library(microbenchmark)
library(stringi)
library(data.table)
library(iotools)
microbenchmark(
original = {
newT <- lapply(stri_split_regex(myT, "\\s+", omit_empty=T), as.numeric)
unlist(newT)
},
data.table = {
x <- paste(myT, collapse = "\n")
DT <- fread(x)
c(t(DT))
},
iotools = {
as.vector(t(mstrsplit(myT, sep = " ", ncol = 10, type = "numeric")))
},
strsplit = {
as.numeric(unlist(strsplit(myT, " ")))
},
original2 = {
as.numeric(unlist(stri_split_regex(myT, "\\s+", omit_empty = TRUE)))
},
iotools2 = {
mstrsplit(myT, sep = " ", type = "numeric")[, 1]
}
)
Unit: milliseconds
expr min lq mean median uq max neval cld
original 52.03538 53.56949 56.02025 54.27165 55.40487 94.45513 100 c
data.table 93.10810 94.63730 98.04845 95.41537 96.51202 212.66666 100 e
iotools 18.73776 19.44485 21.00974 19.75573 20.05614 42.47620 100 a
strsplit 67.04637 69.24053 70.58916 69.86529 70.95980 84.86132 100 d
original2 48.25558 49.47346 51.49833 50.14377 50.96139 84.22928 100 b
iotools2 18.53165 19.19126 19.72922 19.52567 19.71340 32.48726 100 a
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.