基于检查的循环效率更高

Question

我编写了一个for循环，该循环进行了一些检查并根据结果返回0或1。 但是，在大型数据集上运行将花费较长的时间（一夜之间仍然在早晨运行）。 关于如何使用dplyr或其他工具提高效率的任何想法？ 谢谢

这是一些测试数据：

tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2), fyear = c("1971", "1971", "1971", "1971", 
"1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", 
"1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", 
"1972", "1972", "1972", "1972", "1972", "1973", "1973", "1973", 
"1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", 
"1973", "1974", "1974", "1974", "1974", "1974", "1974", "1974", 
"1974", "1974", "1974", "1974", "1974", "1975", "1975", "1975", 
"1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975"
), datadate = c(19711231L, 19710129L, 19710226L, 19710331L, 19710430L, 
19710528L, 19710630L, 19710730L, 19710831L, 19710930L, 19711029L, 
19711130L, 19721231L, 19720131L, 19720229L, 19720330L, 19720428L, 
19720531L, 19720630L, 19720731L, 19720831L, 19720929L, 19721031L, 
19721130L, 19721229L, 19731231L, 19730131L, 19730228L, 19730330L, 
19730430L, 19730531L, 19730629L, 19730731L, 19730831L, 19730928L, 
19731031L, 19731130L, 19741231L, 19740131L, 19740228L, 19740329L, 
19740430L, 19740531L, 19740628L, 19740731L, 19740830L, 19740930L, 
19741031L, 19741129L, 19751231L, 19750131L, 19750228L, 19750331L, 
19750430L, 19750530L, 19750630L, 19750731L, 19750829L, 19750930L, 
19751031L), month = c("12", "01", "02", "03", "04", "05", "06", 
"07", "08", "09", "10", "11", "12", "01", "02", "03", "04", "05", 
"06", "07", "08", "09", "10", "11", "12", "12", "01", "02", "03", 
"04", "05", "06", "07", "08", "09", "10", "11", "12", "01", "02", 
"03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "01", 
"02", "03", "04", "05", "06", "07", "08", "09", "10")), .Names = c("cusip", 
"fyear", "datadate", "month"), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

对于循环：

for(i in min(tdata$cusip):max(tdata$cusip)){ 
    for (j in min(tdata$fyear):max(tdata$fyear)) {
      monthcheck <- filter(tdata, cusip == i & (fyear == j-1 | fyear == j-2 | fyear == j-3 | fyear == j-4))
      if((length(monthcheck$month) / 60) >= 0.4) tdata$check[tdata$cusip == i & tdata$fyear ==  j] <- 1
}}

因为支票通过，所以它返回1973-1975的1。 有一种方法可以使此for循环更高效，因为要在大型数据集上运行将需要一些时间？

编辑：for循环的说明

对于每个唯一ID（cusip）和每年（fyear），请使用select获取过去4年的数据，然后计算观察值的数量并检查其是否大于40％。 如果是这样， tdata$check特定的cusip分配1到tdata$check 。

这样做的目的是确保每个唯一ID至少有60个前一个月的观测值中的24个。

Answer 1

分组总和滞后的解决方案：

library(dplyr)

tdata %>%
  group_by(cusip, fyear) %>%
  summarise(number = n(), share = n() / 60)  %>% 
  mutate( cum_y = lag(cumsum(share)), 
          cum_y4 = lag(cum_y, 4),
          last4 = ifelse(is.na(cum_y4), cum_y, cum_y - cum_y4),
          check = as.numeric( last4 >= 0.4 )
          ) %>%
  select(cusip, fyear, last4, check)

解释：

按fyear ，统计观察值并获得一年的share
cum_y是滞后的累计股份总数
cum_y4是4年落后cum_y
last4是cum_y和cum_y4之间的区别
check是检查last4

更新

结合原始数据中的变量：

... %>%
  left_join(tdata, by = c("cusip", "fyear"))

基于检查的循环效率更高

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-04-09 21:16:27

更新

基于检查的循环效率更高

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-04-09 21:16:27

更新

解决方案1
2 已采纳 2015-04-09 21:16:27