R代码非常慢

Question

Recently I have been working on some R scripts to do some reports. 最近，我一直在研究一些R脚本来做一些报告。 One of the tasks involved is to check if a value in a column matches any row of another dataframe. 涉及的任务之一是检查列中的值是否与另一个数据框的任何行匹配。 If this is true, then set a new column with logical TRUE/FALSE. 如果是这样，则使用逻辑TRUE / FALSE设置新列。

More specifically, I need help improving this code chunk: 更具体地说，我需要改善此代码块的帮助：

for (i in 1:length(df1$Id)) {
  df1 <- within(df1, newCol <- df1$Id %in% df2$Id)
}
df1$newCol <- as.factor(df1$newCol)

The dataset has about 10k rows so it does not make sense to need 6 minutes (tested with proc.time() to execute it completely, which is what it is currently happening. Also, I need to do so other types of checking, so I really need to get this right. 该数据集大约有1万行，因此不需要6分钟（用proc.time()测试才能完全执行它，这是当前正在发生的事情）。此外，我还需要进行其他类型的检查，因此我真的需要正确解决这个问题。

What am I doing wrong there that is devouring time to accomplish? 在那里浪费时间来完成我在做什么？

Thank you for your help! 谢谢您的帮助！

Answer 1

Your code is vectorized - there is no need for the for loop. 您的代码是矢量化的-不需要for循环。 In this case, you can tell because you don't even use i inside the loop . 在这种情况下，您可以知道是因为您甚至在循环内都不使用i 。 This means your loop is executing the exact same code for the exact same result 10k times. 这意味着您的循环正在为完全相同的结果执行完全相同的代码10k次。 If you delete the for wrapper around your functional line 如果删除功能行周围的for包装器

df1 <- within(df1, newCol <- df1$Id %in% df2$Id)

you should get ~10k times speed-up. 您应该获得约1万倍的提速。

One other comment is that the point of within is to avoid re-typing a data frame's name inside. 另一种意见是，内部的要点是避免在内部重新键入数据框的名称。 So you're missing the point by using df1$ inside within() , and your data frame name is so short that it is longer to type within() in this case. 因此，您通过在inside within()使用df1$错过了这一点，并且您的数据帧名称太短，以至于在这种情况下键入in within()会更长。 Your entire code could be simplified to one line: 您的整个代码可以简化为一行：

df1$newCol = factor(df1$Id %in% df2$Id)

My last comment I'm making from a state of ignorance about your application, so take it with a grain of salt, but a binary variable is almost always nicer to have as boolean (TRUE/FALSE) or integer (1/0) than as a factor. 我的最后一条评论是从对您的应用程序的无知的状态开始的，因此请耐心等待，但是二进制变量总是比布尔值（TRUE / FALSE）或整数（1/0）更好。作为一个因素。 It does depend what you're doing with it, but I would leave the factor() off until necessary. 它的确取决于您在处理它，但是我将在需要之前关闭factor() 。

R代码非常慢

问题描述

1 个解决方案

解决方案1
9 已采纳 2016-11-03 23:14:28

R代码非常慢

问题描述

1 个解决方案

解决方案1 9 已采纳 2016-11-03 23:14:28

解决方案1
9 已采纳 2016-11-03 23:14:28