简体   繁体   English

R代码非常慢

[英]R code incredibly slow

Recently I have been working on some R scripts to do some reports. 最近,我一直在研究一些R脚本来做一些报告。 One of the tasks involved is to check if a value in a column matches any row of another dataframe. 涉及的任务之一是检查列中的值是否与另一个数据框的任何行匹配。 If this is true, then set a new column with logical TRUE/FALSE. 如果是这样,则使用逻辑TRUE / FALSE设置新列

More specifically, I need help improving this code chunk: 更具体地说,我需要改善此代码块的帮助:

for (i in 1:length(df1$Id)) {
  df1 <- within(df1, newCol <- df1$Id %in% df2$Id)
}
df1$newCol <- as.factor(df1$newCol)

The dataset has about 10k rows so it does not make sense to need 6 minutes (tested with proc.time() to execute it completely, which is what it is currently happening. Also, I need to do so other types of checking, so I really need to get this right. 该数据集大约有1万行,因此不需要6分钟(用proc.time()测试才能完全执行它,这是当前正在发生的事情)。此外,我还需要进行其他类型的检查,因此我真的需要正确解决这个问题。

What am I doing wrong there that is devouring time to accomplish? 在那里浪费时间来完成我在做什么?

Thank you for your help! 谢谢您的帮助!

Your code is vectorized - there is no need for the for loop. 您的代码是矢量化的-不需要for循环。 In this case, you can tell because you don't even use i inside the loop . 在这种情况下,您可以知道是因为您甚至在循环内都不使用i This means your loop is executing the exact same code for the exact same result 10k times. 这意味着您的循环正在为完全相同的结果执行完全相同的代码10k次。 If you delete the for wrapper around your functional line 如果删除功能行周围的for包装器

df1 <- within(df1, newCol <- df1$Id %in% df2$Id)

you should get ~10k times speed-up. 您应该获得约1万倍的提速。

One other comment is that the point of within is to avoid re-typing a data frame's name inside. 另一种意见是,内部的要点是避免在内部重新键入数据框的名称。 So you're missing the point by using df1$ inside within() , and your data frame name is so short that it is longer to type within() in this case. 因此,您通过在inside within()使用df1$错过了这一点,并且您的数据帧名称太短,以至于在这种情况下键入in within()会更长。 Your entire code could be simplified to one line: 您的整个代码可以简化为一行:

df1$newCol = factor(df1$Id %in% df2$Id)

My last comment I'm making from a state of ignorance about your application, so take it with a grain of salt, but a binary variable is almost always nicer to have as boolean (TRUE/FALSE) or integer (1/0) than as a factor. 我的最后一条评论是从对您的应用程序的无知的状态开始的,因此请耐心等待,但是二进制变量总是比布尔值(TRUE / FALSE)或整数(1/0)更好。作为一个因素。 It does depend what you're doing with it, but I would leave the factor() off until necessary. 它的确取决于您在处理它,但是我将在需要之前关闭factor()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM