I noticed something very peculiar when converting dates to character classes for large data sets. As an example, I have created a mock data set as follows:
DT = data.table(x=rep("2007-1-1", 1e9), y = rep(1,1e9))
DT[,x] <- as.Date(DT[,x])
Now, I would like to convert the x column of dates from a date format to character.
DT[,x.character:= as.character(x)]
This takes a bit of time for large data sets and I noticed that the time it takes to convert decreases dramatically if we did the following:
DT[,x.character:= as.character(x+y-y)]
All I did here was add y and subtract y, so I really am just getting the same results. From a logical standpoint, it seems like I am making the computer do more work. However, is there a reason why this method would result in a faster run than the straight conversion way?
For illustrative purposes, I ran these processes twice with 10000 rows with system.time() and got these results:
DT = data.table(x=rep(as.Date("2007-1-1"), 1e5), y = rep(1,1e5))
system.time(DT[,x.character:= as.character(x)])
> user system elapsed
1.89 0.12 2.03
system.time(DT[,x.character:= as.character(x+y-y)])
> user system elapsed
0.635 0.008 0.643
system.time(DT[,x.character.sub:= as.character(x+y-y+y-y)])
> user system elapsed
0.347 0.004 0.351
As we can see, the second method results in less time needed, and more interestingly, the third method, with more of the yy method, results in even less time. Is there a reason why?
Thank you!
It's faster the second time you call as.character
during the R session because all the characters have been added to the global cache. Adding and subtracting another variable is not relevant.
> library(data.table)
data.table 1.9.3 For help type: help("data.table")
> DT = data.table(x=rep(as.Date("2007-1-1"), 1e5), y = rep(1,1e5))
> system.time(DT[,x.character := as.character(x)])
user system elapsed
0.572 0.012 0.584
> system.time(DT[,x.character := as.character(x)])
user system elapsed
0.389 0.008 0.397
> system.time(DT[,x.character := as.character(x)])
user system elapsed
0.332 0.004 0.337
To further the point, this doesn't even have anything to do with data.table. From another new session:
> x <- rep(as.Date("2007-1-1"), 1e5)
> system.time(as.character(x))
user system elapsed
0.529 0.008 0.537
> system.time(as.character(x))
user system elapsed
0.312 0.012 0.324
> system.time(as.character(x))
user system elapsed
0.327 0.008 0.335
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.