简体   繁体   English

在 R 中将数字转换为字符的最快方法

[英]The fastest way to convert numeric to character in R

I need to convert a numeric vector into character in R.我需要将数字向量转换为 R 中的字符。 As I know, there are different ways (see below).据我所知,有不同的方法(见下文)。

It seems the fastest ways are sprintf and gettextf.似乎最快的方法是 sprintf 和 gettextf。

set.seed(1)
a <- round(runif(100000), 2)
system.time(b1 <- as.character(a))
   user  system elapsed 
  0.108   0.000   0.105 
system.time(b2 <- formatC(a))
   user  system elapsed 
  0.052   0.000   0.052 
system.time(b3 <- sprintf('%.2f', a))
   user  system elapsed 
  0.044   0.000   0.046 
system.time(b4 <- gettextf('%.2f', a))
   user  system elapsed 
  0.048   0.000   0.046 
system.time(b5 <- paste0('', a))
   user  system elapsed 
  0.124   0.000   0.129 

Are there other methods to convert numeric into character in R? R中是否有其他方法可以将数字转换为字符? Thanks for any suggestions.感谢您的任何建议。

Actually it seems like formatC comes out faster: 实际上,似乎formatC更快出现:

library(microbenchmark)
a <- round(runif(100000), 2)
microbenchmark(
  as.character(a), 
  formatC(a), 
  format(a), 
  sprintf('%.2f', a), 
  gettextf('%.2f', a), 
  paste0('', a)
)

Output: 输出:

Unit: milliseconds
                expr      min       lq     mean   median       uq       max neval
     as.character(a) 69.58868 70.74803 71.98464 71.41442 72.92168  82.21936   100
          formatC(a) 33.35502 36.29623 38.83611 37.60454 39.27079  72.92176   100
           format(a) 55.98344 56.78744 58.00442 57.64804 58.83614  66.15601   100
  sprintf("%.2f", a) 46.54285 47.40126 48.53067 48.10791 49.12717  65.26819   100
 gettextf("%.2f", a) 46.74888 47.81214 49.23166 48.60025 49.16692  84.90208   100
       paste0("", a) 86.62459 88.67753 90.80720 89.86829 91.33774 125.51421   100

My sessionInfo : 我的sessionInfo

R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-2

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 digest_0.6.4     ggplot2_1.0.0    grid_3.1.0       gtable_0.1.2     MASS_7.3-35     
 [7] munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.3      reshape2_1.4     scales_0.2.4    
[13] stringr_0.6.2    tools_3.1.0    

Since you've rounded a to finite precision, do the conversion of the unique values once, and look these up 由于您已将a舍入为有限精度,因此请对唯一值进行一次转换,然后查看这些值

f0 = formatC
f1 = function(x) { ux = unique(x); formatC(ux)[match(x, ux)] }

This gives identical results 这给出了相同的结果

> identical(f0(a), f1(a))
[1] TRUE

and is faster at least for the sample data set. 并且至少对于样本数据集来说更快。

> microbenchmark(f0(a), f1(a))
Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval
 f0(a) 46.05171 46.89991 47.33683 47.42225 47.58196 52.43244   100
 f1(a) 10.97090 11.39974 11.48993 11.52598 11.58505 11.90506   100

(though is this efficiency really relevant in R?) (虽然这种效率在R中真的相关吗?)

Three other methods I can think of, none of which are as fast as gettextf are 我能想到的其他三种方法,其中没有一种方法和gettextf一样快

storage.mode(a) <- "character"
mode(a) <- "character"
as.vector(a, "character")

The last one is basically as.character.default , bypassing method dispatching. 最后一个基本上是as.character.default ,绕过方法调度。 Timings for all of these are about the same as paste(a) 所有这些的计时与paste(a)大致相同

Benchmark:基准:

set.seed(1)
a=round(runif(100000),2)

times=10
options=15

b=microbenchmark(times=1000,
  as.character(a),
  as.vector(a,"character"),
  format(a),
  format(a,scientific=F),
  sprintf("%.2f",a),
  gettextf("%.2f",a),
  formatC(a),
  formatC(a,2,,"f"),
  {a2=a;mode(a2)="character"},
  {a2=a;storage.mode(a2)="character"},
  {ux=unique(a);formatC(ux)[match(a,ux)]},
  sub("(..)$",".\\1",as.integer(100*a)), # this uses a format like .12 instead of 0.12
  {c=as.character(as.integer(100*a));nc=nchar(c);paste0(substr(c,1,nc-3),".",substr(c,nc-1,nc))}, # this uses a format like .12 instead of 0.12
  as.character(as.integer(100*a)), # this doesn't include a decimal point and this truncates numbers instead of rounding
  as.character(round(100*a)) # this doesn't include a decimal point
)

m=aggregate(b$time,list(gsub(" ","",gsub("     ",";",gsub("\\{    ","{",b$expr)))),median)
m=m[order(m[,2]),]
writeLines(paste(sprintf("%.3f",m[,2]/min(m[,2])),gsub(" ","",m[,1])))

This shows the median time of a thousand runs relative to the fastest option:这显示了相对于最快选项的一千次运行的中位时间:

1.000 as.character(a)
1.322 as.vector(a,"character")
1.715 {a2=a;storage.mode(a2)="character"}
5.669 {a2=a;mode(a2)="character"}
90.517 as.character(as.integer(100*a))
154.234 as.character(round(100*a))
561.901 {ux=unique(a);formatC(ux)[match(a,ux)]}
3161.683 formatC(a)
3438.877 formatC(a,2,,"f")
3566.394 gettextf("%.2f",a)
3571.937 sprintf("%.2f",a)
3991.150 {c=as.character(as.integer(100*a));nc=nchar(c);paste0(substr(c,1,nc-3),".",substr(c,nc-1,nc))}
4746.212 format(a)
4747.499 format(a,scientific=F)
6004.276 sub("(..)$",".\\1",as.integer(100*a))

At first I thought that maybe as.character was so much faster than formatC because I ran the benchmark multiple times with the same input, but it's also faster when it's just ran a single time:起初我认为as.character可能比formatC快得多,因为我用相同的输入多次运行基准测试,但只运行一次它也更快:

> v=rnorm(1e6);t=Sys.time();v2=formatC(v);Sys.time()-t
Time difference of 0.2929869 secs
> v=rnorm(1e6);t=Sys.time();v2=as.character(v);Sys.time()-t
Time difference of 0.0001451969 secs

Note that many methods use scientific notation for numbers whose absolute value is below 1e-3:请注意,许多方法对绝对值低于 1e-3 的数字使用科学计数法:

> as.vector(.0001,"character")
[1] "1e-04"
> as.character(.0001)
[1] "1e-04"
> n=.0001;storage.mode(n)="character";n
[1] "1e-04"
> n=.0001;mode(n)="character";n
[1] "1e-04"
> formatC(.0001)
[1] "0.0001"
> format(.0001,scientific=F)
[1] "0.0001"
> as.name(.0001)
[1] `1e-04`

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM