简体   繁体   中英

R data.table comparing dates between tables and counting records

I have two data tables: a and b

a = structure(list(id = c(86246, 86252, 12262064), brand = c(3718L, 
13474L, 17286L), offerdate = structure(c(15454, 15791, 15883), class = "Date")), .Names = c("id", 
"brand", "offerdate"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x041c24a0>)

b = structure(list(id = c(86246, 86246, 86246), brand = c(3718, 3718, 
875), date = structure(c(15408, 15430, 15434), class = "Date")), .Names = c("id", 
"brand", "date"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x041c24a0>)

> a
         id brand  offerdate
1:    86246  3718 2012-04-24
2:    86252 13474 2013-03-27
3: 12262064 17286 2013-06-27
> b
      id brand       date
1: 86246  3718 2012-03-09
2: 86246  3718 2012-03-31
3: 86246   875 2012-04-04

Now I would like, for each id in a, to count the number of rows in b for the same id and brand, with a date less than 30 days before the a.offerdate.

The outcome I wish to have is an updated a:

> a
         id brand  offerdate  nbTrans_last_30_days
1:    86246  3718 2013-04-24                     1
2:    86252 13474 2013-03-27                     0
3: 12262064 17286 2013-06-27                     0

I can do the job with subset, but I am looking for a fast solution. The subset version would be to do (for each line of a):

subset(b, (id == 86246) & (brand == 3718) & (date > as.Date("2012-03-24")) )

with the date depending on a.offerdate .

I manage to count the total rows in b:

> setkey(a,id, brand)
> setkey(b,id, brand)
> a = a[b[a, .N]]
> setnames(a, "N", "nbTrans")
> a
         id brand  offerdate nbTrans
1:    86246  3718 2012-04-24       2
2:    86252 13474 2013-03-27       0
3: 12262064 17286 2013-06-27       0

but I do not know how to handle the comparison of dates between the two tables.


The answer below works for the original small data set, but somehow did not work for my real data. I tried to reproduce the problem with two new variables: a2 and b2

a2=structure(list(id = c(86246, 86252, 12262064), brand = structure(c(3L, 
+ 9L, 12L), .Label = c("875", "1322", "3718", "4294", "5072", "6732", 
+ "6926", "7668", "13474", "13791", "15889", "17286", "17311", 
+ "26189", "26456", "28840", "64486", "93904", "102504"), class = "factor"), 
+     offerdate = structure(c(15819, 15791, 15883), class = "Date")), .Names = c("id", 
+ "brand", "offerdate"), row.names = c(NA, -3L), class = c("data.table", 
+ "data.frame"))

b2=structure(list(id = c(86246, 86246, 86246, 86246, 86246, 86246, 
+ 86246, 86246), brand = c(3718L, 3718L, 3718L, 3718L, 3718L, 3718L, 
+ 3718L, 3718L), date = structure(c(15423, 15724, 15752, 15767, 
+ 15782, 15786, 15788, 15811), class = "Date")), .Names = c("id", 
+ "brand", "date"), sorted = c("id", "brand"), class = c("data.table", 
+ "data.frame"))

> setkey(a2,id,brand)
> setkey(b2,id,brand)
> merge(a2, b2, all.x = TRUE, allow.cartesian = TRUE)
         id brand  offerdate date
1:    86246  3718 2013-04-24 <NA>
2:    86252 13474 2013-03-27 <NA>
3: 12262064 17286 2013-06-27 <NA>

The problem is that the merge does not keep the b2.date information.

The trick is to use allow.cartesian argument in merge :

setkey(a, id, brand)
setkey(b, id, brand)

c <- merge(a, b, all.x = T, allow.cartesian = T)

c[, Trans := (offerdate - date) <= 30]

c[, list(nbTrans_last_30_days = sum(Trans, na.rm = T)),
  keyby = list(id, brand, offerdate)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM