简体   繁体   English

R:遍历数据框,提取多个变量的子集,然后存储在聚合数据集中

[英]R: Loop through data frame, extract subset of multiple variables, then store in an aggregate dataset

I have an aggregate data table with about 60 million rows. 我有大约6000万行的汇总数据表。 Simplified, the data looks like this: 简化后,数据如下所示:

ServiceN  Customer  Product  LValue  EDate  CovBDate  CovEDate
1   1   12  3    2016-08-03 2016-07-07 2017-07-06
2   1   12  19   2016-07-07 2016-07-07 2017-07-06
3   2   23  222  2017-09-09 2016-10-01 2017-09-31
4   2   23  100  2017-10-01 2017-10-01 2018-09-31

I need to go through each row and subset the entire dataset by Customer with all entry dates(EDate) between CovBDate and CovEDate. 我需要遍历每一行,并按客户将整个数据集子集化,并在CovBDate和CovEDate之间输入所有日期(EDate)。 Then, I need to find the sum of the LValue for each product (we're only looking at 10, so it's not terrible). 然后,我需要找到每个产品的LValue的总和(我们只看10,所以并不可怕)。

As an example, the final dataset would look something like this: 例如,最终的数据集将如下所示:

ServiceN  Customer  Product  LValue  EDate  CovBDate  CovEDate Prod12 Prod23
1   1   12  3    2016-08-03 2016-07-07 2017-07-06  22  0
2   1   12  19   2016-07-07 2016-07-07 2017-07-06  22  0
3   2   23  222  2017-09-09 2016-10-01 2017-09-31  0   222
4   2   23  100  2017-10-01 2017-10-01 2018-09-31  0   100

I don't know where to begin on this problem, however, I've started with this (which does not work): 我不知道从哪里开始这个问题,但是,我已经开始了(这不起作用):

for (i in 1:length(nrow)) {
  tempdata<-dataset[Customer==Customer[i] & EDate>=CovBDate[i] & 
  EDate<=CovEDate[i]] #data.table subsetting
  tempdata$Prod12<- with(tempdata, sum(LValue[Product== "12"], na.rm=T))
  #I could make this a function, but I want to get this for loop automated first...
  tempdata$Prod23<- with(tempdata, sum(LValue[Product=="23"], na.rm=T))
}

My questions, therefore, are: 因此,我的问题是:
1) How do I make this for loop work with so many variables? 1)如何使for循环使用这么多变量?
2) How do I make the new variable get added to the original dataset (called dataset)? 2)如何使新变量添加到原始数据集(称为数据集)?

Using dplyr you could do something like this: 使用dplyr可以执行以下操作:

library(dplyr)

dataset <- data.frame(ServiceN = c("1", "2", "3", "4"),
    Customer = c("1", "1", "2", "2"),
    Product = c("12", "12", "23", "23"),
    LValue = c(3, 19, 222, 100),
    EDate  = c("2016-08-03", "2016-07-07", "2017-09-09", "2017-10-01"),
    CovBDate = c("2016-07-07", "2016-07-07", "2016-10-01", "2017-10-01"),
    CovEDate = c("2017-07-06", "2017-07-06", "2017-09-31", "2018-09-31"),
    stringsAsFactors = FALSE)

## Group by customer and product so summary results are per-customer/product combination
dataset %>% group_by(Customer, Product) %>%
    ## Filter based on dates
    filter(EDate >= CovBDate & EDate <= CovEDate) %>%
    ## Sum the LValue based on the defined groupings
    summarise(Sum = sum(LValue))


## A tibble: 2 x 3
## Groups:   Customer [?]
# Customer Product   Sum
#<chr>    <chr>   <dbl>
#1 1        12         22
#2 2        23        322

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM