简体   繁体   English

减少计算时间

[英]Reduce computation time

Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me. 我使用的大多数数据集通常大小适中(大多数少于10万行),因此代码的执行时间通常对我来说不是什么大问题。

But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. 但是我最近试图编写一个函数,该函数将2个数据帧作为参数(例如,m&n行),并返回一个包含m * n行的新数据帧。 I then have to perform some operations on the resulting data set. 然后,我必须对结果数据集执行一些操作。 So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows. 因此,即使m&n的值较小(每个值约为1000),结果数据帧也将具有超过一百万行。

When I try even simple operations on this dataset, the code takes an intolerably long time to run. 当我尝试对该数据集执行甚至简单的操作时,代码都将花费非常长的时间才能运行。 Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied" 具体来说,我得到的数据框有2个带有数字值的列,我需要添加一个新列,该列将比较这些列的值并将其分类为-“大于”,“小于”,“捆绑”

I am using the following code: 我正在使用以下代码:

df %>% mutate(compare=ifelse(var1==var2,"tied",
              ifelse(var1>var2,"Greater than","lesser then")

And, as I mentioned before, this takes forever to run. 而且,正如我之前提到的,这需要永远运行。 I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try. 我对此进行了一些研究,我发现对data.table的操作显然比dataframe快得多,所以也许这是我可以尝试的一种选择。

But I have never used data.tables before. 但是我以前从未使用过data.tables。 So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets. 因此,在我陷入困境之前,我很想知道是否还有其他方法可以加快大型数据集的计算时间。

What other options do you think I can try? 您认为我还能尝试其他哪些选择?

Thanks! 谢谢!

For large problems like this I like to parallelize. 对于像这样的大问题,我喜欢并行化。 Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation. 由于对单个行的操作是原子的,这意味着对特定行的操作结果独立于其他所有行,因此这是“令人尴尬的并行”情况。

library(doParallel)
library(foreach)

registerDoParallel() #You could specify the number of cores to use here. See the documentation.

df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
    #Borrowing from @nicola in the comments because it's a good solution.
    c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM