简体   繁体   English

如果日期在日期范围内,则按组汇总R

[英]R sum by group if date within date range

Suppose I have two dataframes. 假设我有两个数据框。

The first one includes "Date" at which a "Name" issues a "Rec" for an "ID" and the "Stop.Date" at which "Rec" becomes invalid. 第一个包括“日期”和“ Stop.Date”,在“日期”中“名称”为“ ID”发布“ Rec”,在“ Stop.Date”中“ Rec”变为无效。

df (only a part) df(仅一部分)

structure(list(Date = structure(c(13236, 13363, 14074, 13199, 
14554), class = "Date"), ID = c("AU0000XINAA9", "AU0000XINAA9", 
"AU0000XINAC5", "AU0000XINAI2", "AU0000XINAJ0"), Name = c("N+1 BREWIN", 
"N+1 BREWIN", "ARBUTHNOT SECURITIES LTD.", "INVESTEC BANK (UK) PLC", 
"AWRAQ INVESTMENTS"), Rec = c(1, 2, 2, 2, 1), Stop.Date = structure(c(13363, 
13509, 14937, 13230, 16702), class = "Date")), .Names = c("Date", 
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L))

The Second dataframe only contains a time series: Let's say in this case from 2006-03-29 until end of 2006. 第二个数据帧仅包含一个时间序列:在这种情况下,假设从2006-03-29到2006年底。

df2 df2

      Date1
  1: 2006-02-20
  2: 2006-02-21
  3: 2006-02-22
  4: 2006-02-23
  5: 2006-02-24
 ---           
311: 2006-12-27
312: 2006-12-28
313: 2006-12-29
314: 2006-12-30
315: 2006-12-31

Now I want my code to sum all "Rec" gouped by ID and Name if the "Date1" variable in df2 falls within the time range (Date until Stop.Date) 现在,如果df2中的“ Date1”变量在时间范围内(从Date到Stop.Date),我希望我的代码对ID和Name组合的所有“ Rec”求和。

I found this post R - If date falls within range, then sum and it seems very close to my problem but the solution does not consider any groups. 我发现了这个帖子R-如果日期在范围内,则求和 ,这似乎与我的问题非常接近,但解决方案未考虑任何组。

I want to come up with a data.frame in which for each date in df2 the sum of "REC" for each single "ID" is shown. 我想提出一个data.frame,其中在df2中的每个日期都显示每个“ ID”的“ REC”总和。

Expected output eg 预期产量,例如

        Date1         ID          SumRec 

    1 2006-02-20 AU0000XINAI2        2
    2 2006-02-21 AU0000XINAI2        2
...
    4 2006-03-29 AU0000XINAA9        1
    5 2006-03-30 AU0000XINAA9        1
    6 2006-08-03 AU0000XINAA9        2  # since Date1 2006-08-03 is at the end 
                                          of range in df (row#1)-> it falls 
                                          within range in df (row#2) 
...

Please keep in mind this is only a small part of the data. 请记住,这只是数据的一小部分。 Usually there exists many more Recs for each "ID" from different "Names". 通常,对于来自不同“名称”的每个“ ID”,存在更多的Recs。 (then sum function makes sense) (然后求和函数很有意义)

Many thanks for your help in advance. 非常感谢您的提前帮助。

UPDATED VERSION 更新后的版本

new dataframes: 新数据框:

df df

structure(list(Date = structure(c(9905, 10381, 10381, 10954, 
10584, 10632, 10778, 10520, 10631, 10905), class = "Date"), ID = c("BMG4593F1389", 
"BMG4593F1389", "BMG4593F1389", "BMG4593F1389", "BMG4593F1389", 
"BMG4593F1389", "BMG4593F1389", "BMG526551004", "BMG526551004", 
"BMG526551004"), Name = c("ING FM", "Permission Denied 128064", 
"Permission Denied 2880", "Permission Denied 2880", "Permission Denied 32", 
"Permission Denied 888", "Permission Denied 888", "Permission Denied 2880", 
"Permission Denied 2880", "Permission Denied 2880"), Rec = c(2, 
3, 2, 2, 3, 3, 3, 1, 3, 3), Stop.Date = structure(c(12095, 11232, 
10954, 11180, 11345, 10764, 11667, 10631, 10905, 11087), class = "Date")), .Names = c("Date", 
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))

df2 df2

structure(list(Date1 = structure(c(10954, 10955, 10956, 10957, 
10958, 10959), class = "Date")), .Names = "Date1", row.names = c(NA, 
-6L), class = c("data.table", "data.frame"))

If I now execute the following code: 如果我现在执行以下代码:

> df=df[,interval := interval(df$Date, df$Stop.Date)]
> 
> df1 <- do.call(rbind, lapply(df2$Date1, function(x){   index <- x
> %within% df$interval;   list(ID = ifelse(any(index), df$ID[index],
> NA), Rec = ifelse(any(index), df$Rec[index], NA), 
>        Name = ifelse(any(index), df$Name[index], NA),interval = ifelse(any(index),df$interval[index],NA))})) 
> 
> df3 <- cbind(df2, df1)

I come up with the following result: 我得出以下结果:

     Date1        ID        Rec  Name interval
1: 1999-12-29 BMG4593F1389   2 ING FM 189216000
2: 1999-12-30 BMG4593F1389   2 ING FM 189216000
3: 1999-12-31 BMG4593F1389   2 ING FM 189216000
4: 2000-01-01 BMG4593F1389   2 ING FM 189216000
5: 2000-01-02 BMG4593F1389   2 ING FM 189216000
6: 2000-01-03 BMG4593F1389   2 ING FM 189216000

But since eg the df2$Date1 ("1999-12-29") for the df$ID "BMG4593F1389" falls within the date range of 6 more entries in df (for different df$Names) FOR THIS particular df$date1 it should be: 但是,由于例如df $ ID“ BMG4593F1389”的df2 $ Date1(“ 1999-12-29”)属于该特定df $ date1的df中还有6个条目的日期范围(对于不同的df $ Name),因此应该是:

Expected result for Date 1999-12-29 (df3$interval variable neglected here for simplicity) 日期1999-12-29的预期结果 (为简单起见,此处忽略了df3 $ interval变量)

         Date1        ID        Rec         Name 
    1: 1999-12-29 BMG4593F1389   2   ING FM 
    2: 1999-12-29 BMG4593F1389   3   Permission Denied 128064 
    3: 1999-12-29 BMG4593F1389   2   Permission Denied 2880
    4: 1999-12-29 BMG4593F1389   3   Permission Denied 32
    5: 1999-12-29 BMG4593F1389   3   Permission Denied 888

    6: 1999-12-29 BMG5265510042  3   Permission Denied 2880

    7: 1999-12-30 BMG4593F1389   2   ING FM
... etc

So at the end I need the Dates in df$Date1 replicated if more than one name issues a Rec for a specific df$ID which falls within the respective date range. 因此,如果有多个名称针对特定df $ ID发出了属于相应日期范围的Rec,那么最后我需要复制df $ Date1中的Date。

Can somebody help me with that? 有人可以帮我吗?

If I understand the updated version of the question correctly, this can be solved using a non-equi join and subsequent aggregation: 如果我正确理解问题的更新版本 ,则可以使用非等额联接和后续聚合来解决此问题:

library(data.table)
# non-equi join
df[df2, on = .(Date <= Date1, Stop.Date > Date1), allow = TRUE][
  # aggregation
  , .(sumRec = sum(Rec)), by = .(Date, ID, Name)]
  Date ID Name sumRec 1: 1999-12-29 BMG4593F1389 ING FM 2 2: 1999-12-29 BMG4593F1389 Permission Denied 128064 3 3: 1999-12-29 BMG4593F1389 Permission Denied 2880 2 4: 1999-12-29 BMG4593F1389 Permission Denied 32 3 5: 1999-12-29 BMG4593F1389 Permission Denied 888 3 6: 1999-12-29 BMG526551004 Permission Denied 2880 3 7: 1999-12-30 BMG4593F1389 ING FM 2 8: 1999-12-30 BMG4593F1389 Permission Denied 128064 3 9: 1999-12-30 BMG4593F1389 Permission Denied 2880 2 10: 1999-12-30 BMG4593F1389 Permission Denied 32 3 11: 1999-12-30 BMG4593F1389 Permission Denied 888 3 12: 1999-12-30 BMG526551004 Permission Denied 2880 3 13: 1999-12-31 BMG4593F1389 ING FM 2 14: 1999-12-31 BMG4593F1389 Permission Denied 128064 3 15: 1999-12-31 BMG4593F1389 Permission Denied 2880 2 16: 1999-12-31 BMG4593F1389 Permission Denied 32 3 17: 1999-12-31 BMG4593F1389 Permission Denied 888 3 18: 1999-12-31 BMG526551004 Permission Denied 2880 3 19: 2000-01-01 BMG4593F1389 ING FM 2 20: 2000-01-01 BMG4593F1389 Permission Denied 128064 3 21: 2000-01-01 BMG4593F1389 Permission Denied 2880 2 22: 2000-01-01 BMG4593F1389 Permission Denied 32 3 23: 2000-01-01 BMG4593F1389 Permission Denied 888 3 24: 2000-01-01 BMG526551004 Permission Denied 2880 3 25: 2000-01-02 BMG4593F1389 ING FM 2 26: 2000-01-02 BMG4593F1389 Permission Denied 128064 3 27: 2000-01-02 BMG4593F1389 Permission Denied 2880 2 28: 2000-01-02 BMG4593F1389 Permission Denied 32 3 29: 2000-01-02 BMG4593F1389 Permission Denied 888 3 30: 2000-01-02 BMG526551004 Permission Denied 2880 3 31: 2000-01-03 BMG4593F1389 ING FM 2 32: 2000-01-03 BMG4593F1389 Permission Denied 128064 3 33: 2000-01-03 BMG4593F1389 Permission Denied 2880 2 34: 2000-01-03 BMG4593F1389 Permission Denied 32 3 35: 2000-01-03 BMG4593F1389 Permission Denied 888 3 36: 2000-01-03 BMG526551004 Permission Denied 2880 3 Date ID Name sumRec 

Please, note that I experienced a strange error message when using df as provided in structure(...) directly. 请注意,直接使用structure(...)提供的df时遇到了奇怪的错误消息。 The error message went away after calling 调用后错误消息消失

df <- as.data.table(df)

Explanation 说明

I was asked to explain how the non-equi join works. 我被要求解释非等额联接的工作原理。 Non-equi joins are an extension of the data.table joins. 非等距联接data.table联接的扩展。 data.table is a package which enhances base R's data.frame . data.table是一个增强基本R的data.framedata.frame

Here, we right join df2 with df , ie, we want to see all rows of df2 with matches in df in the result but only those where Date1 (from df2 ) lies between Date and Stop.Date (from df ), Date <= Date1 < Stop.Date to be exact. 在这里,我们将df2df正确连接,即,我们希望查看结果中df2所有与df匹配的行,但仅Date1 (来自df2 )位于DateStop.Date (来自df )之间, Date <= Date1 < Stop.Date是准确的。 As there are many possible matches, we need to use allow.cartesian = TRUE . 由于存在许多可能的匹配项,因此我们需要使用allow.cartesian = TRUE

There is a video of Arun's talk at the useR! 在useR上有一段阿伦演讲的视频! 2016 international R User conference introducing Efficient in-memory non-equi joins using data.table . 2016年国际R用户会议,介绍了使用data.table进行高效的内存非设备联接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM