简体   繁体   中英

Mystery: Why does the as.character() function in a data.table run faster if I add and subtract another variable?

I noticed something very peculiar when converting dates to character classes for large data sets. As an example, I have created a mock data set as follows:

DT = data.table(x=rep("2007-1-1", 1e9), y = rep(1,1e9))
DT[,x] <- as.Date(DT[,x])

Now, I would like to convert the x column of dates from a date format to character.

DT[,x.character:= as.character(x)] 

This takes a bit of time for large data sets and I noticed that the time it takes to convert decreases dramatically if we did the following:

DT[,x.character:= as.character(x+y-y)]

All I did here was add y and subtract y, so I really am just getting the same results. From a logical standpoint, it seems like I am making the computer do more work. However, is there a reason why this method would result in a faster run than the straight conversion way?

For illustrative purposes, I ran these processes twice with 10000 rows with system.time() and got these results:

DT = data.table(x=rep(as.Date("2007-1-1"), 1e5), y = rep(1,1e5))

system.time(DT[,x.character:= as.character(x)]) 
> user  system elapsed 
1.89    0.12    2.03 

system.time(DT[,x.character:= as.character(x+y-y)]) 
> user  system elapsed 
0.635   0.008   0.643 

system.time(DT[,x.character.sub:= as.character(x+y-y+y-y)]) 
> user  system elapsed 
0.347   0.004   0.351 

As we can see, the second method results in less time needed, and more interestingly, the third method, with more of the yy method, results in even less time. Is there a reason why?

Thank you!

It's faster the second time you call as.character during the R session because all the characters have been added to the global cache. Adding and subtracting another variable is not relevant.

> library(data.table)
data.table 1.9.3  For help type: help("data.table")
> DT = data.table(x=rep(as.Date("2007-1-1"), 1e5), y = rep(1,1e5))
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.572   0.012   0.584 
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.389   0.008   0.397 
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.332   0.004   0.337 

To further the point, this doesn't even have anything to do with data.table. From another new session:

> x <- rep(as.Date("2007-1-1"), 1e5)
> system.time(as.character(x)) 
   user  system elapsed 
  0.529   0.008   0.537 
> system.time(as.character(x)) 
   user  system elapsed 
  0.312   0.012   0.324 
> system.time(as.character(x)) 
   user  system elapsed 
  0.327   0.008   0.335 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM