I am struggling days now to find a way to match two different tables (unbalanced in row number) in R with the matching range of dates.
The first data frame represents what someone has declared and the second data.frame represents what actually is.The problem is that an id say A might have kept for a week or more a amount of product in kilos in his/her storage and then appear it one week after.
So i have two data frames with different rows. Both number for rows exceed 500.000.The first has 500.000 and the second 520.000.(so some NAs might occur but this doesn't matter).
I want to seach and match from these 2 different data frames:
and if all these conditions are true to sum the two weights (from the data frames 1 and 2)
The desired output must be a data frame with the matched id, range of date, the product, the man, the 2 columns weights and the last column to be the values that will be the sum of the weights.
the example data frame are the two data frames in R below:
id = rep("A",10)
date = seq(as.Date("2000/1/1"), by = "day", length.out = 10)
product = c("E1","E1","E2","E3","E3","E4","E1","E1","E5","E5")
man = c("PS","PS","PS","DDR","DDR","FFR","PS","PS","BA","BA")
weight = c(1505,300,259,231,140,150,300,112,203,2200)
data1 = data.frame(id,date,product,man,weight);data1
id2 = rep("A",10)
date2 = sample(seq(as.Date("2000/1/10"), by = "day", length.out = 10),10)
product2 = c("E3","E3","E1","E1","E1","E1","E5","E4","E5","E2")
man2 = c("DDR","DDR","PS","PS","PS","PS","BA","FFR","BA","PS")
weight2 = c(259,295,260,240,135,145,250,110,2000,210)
data2 = data.frame(id2,date2,product2,man2,weight2);data2
How can I match them with the ginned date latency condition?
Any help?
not sure if it's what you want but it looks like a rolling join...
Using data.table and lubridate packages, one version could be:
(Here I show two examples one with 30 days forward and one with one month forward)
set.seed(0)
id = rep("A",10)
date = seq(as.Date("2000/1/1"), by = "day", length.out = 10)
product = c("E1","E1","E2","E3","E3","E4","E1","E1","E5","E5")
man = c("PS","PS","PS","DDR","DDR","FFR","PS","PS","BA","BA")
weight = c(1505,300,259,231,140,150,300,112,203,2200)
data1 = data.frame(id,date,product,man,weight);data1
id2 = rep("A",10)
date2 = sample(seq(as.Date("2000/1/10"), by = "day", length.out = 10),10)
product2 = c("E3","E3","E1","E1","E1","E1","E5","E4","E5","E2")
man2 = c("DDR","DDR","PS","PS","PS","PS","BA","FFR","BA","PS")
weight2 = c(259,295,260,240,135,145,250,110,2000,210)
data2 = data.frame(id2,date2,product2,man2,weight2);data2
require(data.table)
require(lubridate)
setDT(data1)
setkey(data1)
setDT(data2)
setkey(data2)
data1[, weight_p50 := weight + 50]
data1[, weight_m50 := weight - 50]
data1[, date := as.Date(date)]
data2[, date2 := as.Date(date2)]
data1[, date_plus_one_month := as.Date(ymd(date %m+% months(1)))]
data1[, date_plus_30_days := as.Date(date + days(30))]
data2[, date2_ := date2]
data1[, date_ := date]
res <-
data1[data2, list(id, product, man, date = date_, date2, weight, weight2, total_weight = weight + weight2),
on = .(id = id2,
product = product2,
man = man2,
date_plus_30_days >= date2,
date <= date2_,
weight_p50 >= weight2,
weight_m50 <= weight2),
allow.cartesian = TRUE, nomatch = 0
][][]
setkey(res, id, product, man, date, date2)
res <- unique(res, by = c("id", "product", "man"))
res[]
which gives:
id product man date date2 weight weight2 total_weight
1: A E1 PS 2000-01-02 2000-01-16 300 260 560
2: A E2 PS 2000-01-03 2000-01-17 259 210 469
3: A E3 DDR 2000-01-04 2000-01-18 231 259 490
4: A E4 FFR 2000-01-06 2000-01-19 150 110 260
5: A E5 BA 2000-01-09 2000-01-12 203 250 453
EDIT: following OP comments, I added unicity on id x man x product such that it keeps only first dates satisfying the constraints.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.