[英]R code incredibly slow
Recently I have been working on some R scripts to do some reports. 最近,我一直在研究一些R脚本来做一些报告。 One of the tasks involved is to check if a value in a column matches any row of another dataframe.
涉及的任务之一是检查列中的值是否与另一个数据框的任何行匹配。 If this is true, then set a new column with logical TRUE/FALSE.
如果是这样,则使用逻辑TRUE / FALSE设置新列 。
More specifically, I need help improving this code chunk: 更具体地说,我需要改善此代码块的帮助:
for (i in 1:length(df1$Id)) {
df1 <- within(df1, newCol <- df1$Id %in% df2$Id)
}
df1$newCol <- as.factor(df1$newCol)
The dataset has about 10k rows so it does not make sense to need 6 minutes (tested with proc.time()
to execute it completely, which is what it is currently happening. Also, I need to do so other types of checking, so I really need to get this right. 该数据集大约有1万行,因此不需要6分钟(用
proc.time()
测试才能完全执行它,这是当前正在发生的事情)。此外,我还需要进行其他类型的检查,因此我真的需要正确解决这个问题。
What am I doing wrong there that is devouring time to accomplish? 在那里浪费时间来完成我在做什么?
Thank you for your help! 谢谢您的帮助!
Your code is vectorized - there is no need for the for loop. 您的代码是矢量化的-不需要for循环。 In this case, you can tell because you don't even use
i
inside the loop . 在这种情况下,您可以知道是因为您甚至在循环内都不使用
i
。 This means your loop is executing the exact same code for the exact same result 10k times. 这意味着您的循环正在为完全相同的结果执行完全相同的代码10k次。 If you delete the for wrapper around your functional line
如果删除功能行周围的for包装器
df1 <- within(df1, newCol <- df1$Id %in% df2$Id)
you should get ~10k times speed-up. 您应该获得约1万倍的提速。
One other comment is that the point of within is to avoid re-typing a data frame's name inside. 另一种意见是,内部的要点是避免在内部重新键入数据框的名称。 So you're missing the point by using
df1$
inside within()
, and your data frame name is so short that it is longer to type within()
in this case. 因此,您通过在inside
within()
使用df1$
错过了这一点,并且您的数据帧名称太短,以至于在这种情况下键入in within()
会更长。 Your entire code could be simplified to one line: 您的整个代码可以简化为一行:
df1$newCol = factor(df1$Id %in% df2$Id)
My last comment I'm making from a state of ignorance about your application, so take it with a grain of salt, but a binary variable is almost always nicer to have as boolean (TRUE/FALSE) or integer (1/0) than as a factor. 我的最后一条评论是从对您的应用程序的无知的状态开始的,因此请耐心等待,但是二进制变量总是比布尔值(TRUE / FALSE)或整数(1/0)更好。作为一个因素。 It does depend what you're doing with it, but I would leave the
factor()
off until necessary. 它的确取决于您在处理它,但是我将在需要之前关闭
factor()
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.