如何拆分数据框并对其并行运行自定义功能？

Question

I have a large dataset with around 25L rows, where this function "status" is applied. 我有一个大约25L行的大型数据集，其中应用了“状态”功能。 Its a flagging procedure. 它是一个标记过程。 Inside the fn, operations are vectorised and apply functions are used. 在fn内部，对操作进行矢量化处理，并使用apply函数。 c1-c4 are the columns in my data. c1-c4是我数据中的列。 Still it takes about 5-6 hrs to execute the fn. 仍然需要大约5-6个小时来执行fn。

status(mydata)
status <- function (x) {  

x<- subset(x, x$RECORD_TYPE != "INPUT")
x$c1<- as.character(x$c1)
x$c2 <- as.factor(x$c2)
x$c3 <- as.factor(x$c3)
return ( data.frame(cbind( 
         tapply(x$c2,  x$c4, 
           function (x) ifelse (!(any(x=="BAD")), "G", sum(x== "BAD"))) ,
         tapply(x$c2D,  x$c4, 
            function (x) sum (x== "NEG"))  ))) 
                 }

Is there any way to further speed up the fn. 有什么办法可以进一步加快fn的速度。 I work in a server which has 16 cores. 我在具有16个核心的服务器上工作。 So i believe it can be further sped up. 因此，我相信它可以进一步加快。

Answer 1

Perhaps a data.table approach would be faster than trying to parallelize your code, but I would need a sample of your data to make sure this answer addresses your question: 也许data.table方法比尝试并行化代码更快，但是我需要您的数据样本以确保此答案解决了您的问题：

library(data.table)

setDT(mydata)

mydata[ RECORD_TYPE != "INPUT", 
                   .(var1 = ifelse (!(any(c2=="BAD")), "G", sum(c2== "BAD")),
                     var2 = sum (c2D== "NEG")), by= c4]

如何拆分数据框并对其并行运行自定义功能？

问题描述

1 个解决方案

解决方案1
0 2016-06-21 13:44:11

如何拆分数据框并对其并行运行自定义功能？

问题描述

1 个解决方案

解决方案1 0 2016-06-21 13:44:11

解决方案1
0 2016-06-21 13:44:11