简体   繁体   English

神秘:为什么我添加和减去另一个变量时,data.table中的as.character()函数运行得更快?

[英]Mystery: Why does the as.character() function in a data.table run faster if I add and subtract another variable?

I noticed something very peculiar when converting dates to character classes for large data sets. 在将日期转换为大型数据集的字符类时,我注意到了一些非常奇怪的东西。 As an example, I have created a mock data set as follows: 作为一个例子,我创建了一个模拟数据集如下:

DT = data.table(x=rep("2007-1-1", 1e9), y = rep(1,1e9))
DT[,x] <- as.Date(DT[,x])

Now, I would like to convert the x column of dates from a date format to character. 现在,我想将日期格式的x列转换为字符。

DT[,x.character:= as.character(x)] 

This takes a bit of time for large data sets and I noticed that the time it takes to convert decreases dramatically if we did the following: 这需要花费一些时间来处理大型数据集,我注意到如果我们执行以下操作,转换所需的时间会急剧减少:

DT[,x.character:= as.character(x+y-y)]

All I did here was add y and subtract y, so I really am just getting the same results. 我在这里做的只是添加y并减去y,所以我真的得到了相同的结果。 From a logical standpoint, it seems like I am making the computer do more work. 从逻辑的角度来看,似乎我正在让计算机做更多的工作。 However, is there a reason why this method would result in a faster run than the straight conversion way? 但是,为什么这种方法比直接转换方式更快地运行?

For illustrative purposes, I ran these processes twice with 10000 rows with system.time() and got these results: 为了便于说明,我使用system.time()运行了两次10000行,并获得了以下结果:

DT = data.table(x=rep(as.Date("2007-1-1"), 1e5), y = rep(1,1e5))

system.time(DT[,x.character:= as.character(x)]) 
> user  system elapsed 
1.89    0.12    2.03 

system.time(DT[,x.character:= as.character(x+y-y)]) 
> user  system elapsed 
0.635   0.008   0.643 

system.time(DT[,x.character.sub:= as.character(x+y-y+y-y)]) 
> user  system elapsed 
0.347   0.004   0.351 

As we can see, the second method results in less time needed, and more interestingly, the third method, with more of the yy method, results in even less time. 我们可以看到,第二种方法所需的时间更少,更有趣的是,第三种方法,使用更多的yy方法,可以节省更多的时间。 Is there a reason why? 有原因吗?

Thank you! 谢谢!

It's faster the second time you call as.character during the R session because all the characters have been added to the global cache. 在R会话期间第二次调用as.character时速度会更快,因为所有字符都已添加到全局缓存中。 Adding and subtracting another variable is not relevant. 添加和减去另一个变量是不相关的。

> library(data.table)
data.table 1.9.3  For help type: help("data.table")
> DT = data.table(x=rep(as.Date("2007-1-1"), 1e5), y = rep(1,1e5))
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.572   0.012   0.584 
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.389   0.008   0.397 
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.332   0.004   0.337 

To further the point, this doesn't even have anything to do with data.table. 为了进一步说明,这甚至与data.table没有任何关系。 From another new session: 来自另一个新会议:

> x <- rep(as.Date("2007-1-1"), 1e5)
> system.time(as.character(x)) 
   user  system elapsed 
  0.529   0.008   0.537 
> system.time(as.character(x)) 
   user  system elapsed 
  0.312   0.012   0.324 
> system.time(as.character(x)) 
   user  system elapsed 
  0.327   0.008   0.335 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM