减少计算时间

Question

Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me. 我使用的大多数数据集通常大小适中（大多数少于10万行），因此代码的执行时间通常对我来说不是什么大问题。

But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. 但是我最近试图编写一个函数，该函数将2个数据帧作为参数（例如，m＆n行），并返回一个包含m * n行的新数据帧。 I then have to perform some operations on the resulting data set. 然后，我必须对结果数据集执行一些操作。 So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows. 因此，即使m＆n的值较小（每个值约为1000），结果数据帧也将具有超过一百万行。

When I try even simple operations on this dataset, the code takes an intolerably long time to run. 当我尝试对该数据集执行甚至简单的操作时，代码都将花费非常长的时间才能运行。 Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied" 具体来说，我得到的数据框有2个带有数字值的列，我需要添加一个新列，该列将比较这些列的值并将其分类为-“大于”，“小于”，“捆绑”

I am using the following code: 我正在使用以下代码：

df %>% mutate(compare=ifelse(var1==var2,"tied",
              ifelse(var1>var2,"Greater than","lesser then")

And, as I mentioned before, this takes forever to run. 而且，正如我之前提到的，这需要永远运行。 I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try. 我对此进行了一些研究，我发现对data.table的操作显然比dataframe快得多，所以也许这是我可以尝试的一种选择。

But I have never used data.tables before. 但是我以前从未使用过data.tables。 So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets. 因此，在我陷入困境之前，我很想知道是否还有其他方法可以加快大型数据集的计算时间。

What other options do you think I can try? 您认为我还能尝试其他哪些选择？

Thanks! 谢谢！

Answer 1

For large problems like this I like to parallelize. 对于像这样的大问题，我喜欢并行化。 Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation. 由于对单个行的操作是原子的，这意味着对特定行的操作结果独立于其他所有行，因此这是“令人尴尬的并行”情况。

library(doParallel)
library(foreach)

registerDoParallel() #You could specify the number of cores to use here. See the documentation.

df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
    #Borrowing from @nicola in the comments because it's a good solution.
    c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}

减少计算时间

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-12-11 16:56:59

减少计算时间

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-12-11 16:56:59

解决方案1
1 已采纳 2015-12-11 16:56:59