[英]The fastest way to convert numeric to character in R
I need to convert a numeric vector into character in R.我需要将数字向量转换为 R 中的字符。 As I know, there are different ways (see below).
据我所知,有不同的方法(见下文)。
It seems the fastest ways are sprintf and gettextf.似乎最快的方法是 sprintf 和 gettextf。
set.seed(1)
a <- round(runif(100000), 2)
system.time(b1 <- as.character(a))
user system elapsed
0.108 0.000 0.105
system.time(b2 <- formatC(a))
user system elapsed
0.052 0.000 0.052
system.time(b3 <- sprintf('%.2f', a))
user system elapsed
0.044 0.000 0.046
system.time(b4 <- gettextf('%.2f', a))
user system elapsed
0.048 0.000 0.046
system.time(b5 <- paste0('', a))
user system elapsed
0.124 0.000 0.129
Are there other methods to convert numeric into character in R? R中是否有其他方法可以将数字转换为字符? Thanks for any suggestions.
感谢您的任何建议。
Actually it seems like formatC
comes out faster: 实际上,似乎
formatC
更快出现:
library(microbenchmark)
a <- round(runif(100000), 2)
microbenchmark(
as.character(a),
formatC(a),
format(a),
sprintf('%.2f', a),
gettextf('%.2f', a),
paste0('', a)
)
Output: 输出:
Unit: milliseconds
expr min lq mean median uq max neval
as.character(a) 69.58868 70.74803 71.98464 71.41442 72.92168 82.21936 100
formatC(a) 33.35502 36.29623 38.83611 37.60454 39.27079 72.92176 100
format(a) 55.98344 56.78744 58.00442 57.64804 58.83614 66.15601 100
sprintf("%.2f", a) 46.54285 47.40126 48.53067 48.10791 49.12717 65.26819 100
gettextf("%.2f", a) 46.74888 47.81214 49.23166 48.60025 49.16692 84.90208 100
paste0("", a) 86.62459 88.67753 90.80720 89.86829 91.33774 125.51421 100
My sessionInfo
: 我的
sessionInfo
:
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-2
loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4 ggplot2_1.0.0 grid_3.1.0 gtable_0.1.2 MASS_7.3-35
[7] munsell_0.4.2 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.3 reshape2_1.4 scales_0.2.4
[13] stringr_0.6.2 tools_3.1.0
Since you've rounded a
to finite precision, do the conversion of the unique values once, and look these up 由于您已将
a
舍入为有限精度,因此请对唯一值进行一次转换,然后查看这些值
f0 = formatC
f1 = function(x) { ux = unique(x); formatC(ux)[match(x, ux)] }
This gives identical results 这给出了相同的结果
> identical(f0(a), f1(a))
[1] TRUE
and is faster at least for the sample data set. 并且至少对于样本数据集来说更快。
> microbenchmark(f0(a), f1(a))
Unit: milliseconds
expr min lq mean median uq max neval
f0(a) 46.05171 46.89991 47.33683 47.42225 47.58196 52.43244 100
f1(a) 10.97090 11.39974 11.48993 11.52598 11.58505 11.90506 100
(though is this efficiency really relevant in R?) (虽然这种效率在R中真的相关吗?)
Three other methods I can think of, none of which are as fast as gettextf
are 我能想到的其他三种方法,其中没有一种方法和
gettextf
一样快
storage.mode(a) <- "character"
mode(a) <- "character"
as.vector(a, "character")
The last one is basically as.character.default
, bypassing method dispatching. 最后一个基本上是
as.character.default
,绕过方法调度。 Timings for all of these are about the same as paste(a)
所有这些的计时与
paste(a)
大致相同
Benchmark:基准:
set.seed(1)
a=round(runif(100000),2)
times=10
options=15
b=microbenchmark(times=1000,
as.character(a),
as.vector(a,"character"),
format(a),
format(a,scientific=F),
sprintf("%.2f",a),
gettextf("%.2f",a),
formatC(a),
formatC(a,2,,"f"),
{a2=a;mode(a2)="character"},
{a2=a;storage.mode(a2)="character"},
{ux=unique(a);formatC(ux)[match(a,ux)]},
sub("(..)$",".\\1",as.integer(100*a)), # this uses a format like .12 instead of 0.12
{c=as.character(as.integer(100*a));nc=nchar(c);paste0(substr(c,1,nc-3),".",substr(c,nc-1,nc))}, # this uses a format like .12 instead of 0.12
as.character(as.integer(100*a)), # this doesn't include a decimal point and this truncates numbers instead of rounding
as.character(round(100*a)) # this doesn't include a decimal point
)
m=aggregate(b$time,list(gsub(" ","",gsub(" ",";",gsub("\\{ ","{",b$expr)))),median)
m=m[order(m[,2]),]
writeLines(paste(sprintf("%.3f",m[,2]/min(m[,2])),gsub(" ","",m[,1])))
This shows the median time of a thousand runs relative to the fastest option:这显示了相对于最快选项的一千次运行的中位时间:
1.000 as.character(a)
1.322 as.vector(a,"character")
1.715 {a2=a;storage.mode(a2)="character"}
5.669 {a2=a;mode(a2)="character"}
90.517 as.character(as.integer(100*a))
154.234 as.character(round(100*a))
561.901 {ux=unique(a);formatC(ux)[match(a,ux)]}
3161.683 formatC(a)
3438.877 formatC(a,2,,"f")
3566.394 gettextf("%.2f",a)
3571.937 sprintf("%.2f",a)
3991.150 {c=as.character(as.integer(100*a));nc=nchar(c);paste0(substr(c,1,nc-3),".",substr(c,nc-1,nc))}
4746.212 format(a)
4747.499 format(a,scientific=F)
6004.276 sub("(..)$",".\\1",as.integer(100*a))
At first I thought that maybe as.character
was so much faster than formatC
because I ran the benchmark multiple times with the same input, but it's also faster when it's just ran a single time:起初我认为
as.character
可能比formatC
快得多,因为我用相同的输入多次运行基准测试,但只运行一次它也更快:
> v=rnorm(1e6);t=Sys.time();v2=formatC(v);Sys.time()-t
Time difference of 0.2929869 secs
> v=rnorm(1e6);t=Sys.time();v2=as.character(v);Sys.time()-t
Time difference of 0.0001451969 secs
Note that many methods use scientific notation for numbers whose absolute value is below 1e-3:请注意,许多方法对绝对值低于 1e-3 的数字使用科学计数法:
> as.vector(.0001,"character")
[1] "1e-04"
> as.character(.0001)
[1] "1e-04"
> n=.0001;storage.mode(n)="character";n
[1] "1e-04"
> n=.0001;mode(n)="character";n
[1] "1e-04"
> formatC(.0001)
[1] "0.0001"
> format(.0001,scientific=F)
[1] "0.0001"
> as.name(.0001)
[1] `1e-04`
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.