[英]joining data by date based on time period condition
(This is a follow on question to the one originally posted here . (这是对最初在此处发布的内容的质疑。
My original problem has been solved however I would like to merge these two data sets in a slightly different way also. 我原来的问题已解决,但是我也想以稍微不同的方式合并这两个数据集。
df1
is some sort of "financial report" data and df2
is some sort of "end of year financial data". df1
是某种“财务报告”数据,而df2
是某种“年终财务数据”。 Previously I wanted to link the financial report data up to the last available financial data. 以前,我想将财务报告数据链接到最新的可用财务数据。
Now I would like to "forecast" the financial data ( df2
) using the data in the financial report ( df1
). 现在,我想使用财务报告( df1
)中的数据“预测”财务数据( df2
)。 That is link the data by ID
and date_f
and date
. 那就是通过ID
和date_f
和date
链接数据。
I would like to impose the following condition: 我要施加以下条件:
Join if; 如果加入;
date
( t+1
) from df2
> date_f
( t
) from df1
and the difference must be more than 6 months, 来自df2
date
( t+1
)>来自df1
date_f
( t
),且相差必须超过6个月,
otherwise; 除此以外;
take date
at t+2
. 在t+2
取date
。 (That is, if the date
( t+1
) is less than 6 months after date_f
( t
) then use the date
at ( t+2
). (也就是说,如果date
( t+1
)小于date_f
( t
)之后的6个月,则使用date
( t+2
)。
Essentially I want to use the financial report data df1
to forecast data in df2
but the information in df1
is not useful for forecasting 1 week into the future, therefore I would prefer to forecast the following years data. 本质上,我想使用财务报告数据df1
来预测df2
数据,但是df1
的信息对于预测未来1周没有用,因此,我希望预测接下来的几年数据。
The data looks like the following; 数据如下所示;
df1: df1:
ID date_f
1 1047699 2014-03-03
2 858339 2007-03-01
3 1002910 2009-12-22
4 277135 2011-02-18
5 753308 2004-03-09
6 1018840 2008-02-26
7 1510295 2011-10-21
8 3133 2014-02-27
9 1467858 2010-02-26
10 865436 2004-11-05
df2: df2:
ID date year
1 3133 1999-12-31 1999
2 3133 2000-12-31 2000
3 3133 2001-12-31 2001
4 3133 2002-12-31 2002
5 3133 2003-12-31 2003
6 3133 2004-12-31 2004
Expected output using the first 5 rows of df1
: 使用df1
的前5行的预期输出:
ID date_f date year
1 1047699 2014-03-03
2 858339 2007-03-01 2007-12-31 2007
3 1002910 2009-12-22 2010-12-31 2010 *
4 277135 2011-02-18 2011-12-31 2011
5 753308 2004-03-09 2004-12-31 2004
date
should be 2009-12-31
which is in df2
and it is still > date_f
(by 1 week) however the condition I would like to impose is that "it must be > date_f
and the date
be more than 6 months into the future (or 180 days). So here this observation fails on the second condition (since its just 1 week difference) so I would like to "forecast" the next years date which is 2010-12-31
. 这里的date
应该是df2
2009-12-31
,并且仍然> date_f
(到1周),但是我要施加的条件是“它必须> date_f
,并且date
必须超过6个月未来(或180天)。因此,此观察在第二种条件下失败(因为相差仅1周),因此我想“预测”明年的日期为2010-12-31
。 data1 数据1
df1 <- structure(list(ID = c(1047699L, 858339L, 1002910L, 277135L, 753308L,
1018840L, 1510295L, 3133L, 1467858L, 865436L), date_f = structure(c(16132,
13573, 14600, 15023, 12486, 13935, 15268, 16128, 14666, 12727
), class = "Date")), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x0000000002511ef0>, class = "data.frame")
data2 数据2
df2 <- structure(list(ID = c(3133L, 3133L, 3133L, 3133L, 3133L, 3133L,
3133L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L,
753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L,
753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 277135L,
277135L, 277135L, 277135L, 277135L, 277135L, 277135L, 277135L,
277135L, 277135L, 277135L, 277135L, 277135L, 277135L, 277135L,
277135L, 277135L, 277135L, 277135L, 277135L, 1002910L, 1002910L,
1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L,
1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L,
1002910L, 1002910L, 1002910L, 1002910L, 858339L, 858339L, 858339L,
858339L, 858339L, 858339L, 858339L, 858339L, 858339L, 858339L,
858339L, 858339L, 858339L, 858339L, 858339L, 858339L, 858339L,
858339L, 858339L, 858339L, 865436L, 865436L, 865436L, 865436L,
865436L, 865436L, 865436L, 865436L, 865436L, 865436L, 865436L,
865436L, 865436L, 865436L, 865436L, 865436L, 865436L, 865436L,
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L,
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L,
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L,
1047699L, 1047699L, 1047699L, 1047699L, 1047699L, 1047699L, 1047699L,
1047699L, 1047699L, 1047699L, 1047699L, 1510295L, 1510295L, 1510295L,
1510295L, 1510295L, 1510295L, 1510295L, 1510295L, 1510295L, 1510295L
), date = structure(c(10956, 11322, 11687, 12052, 12417, 12783,
13148, 10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513,
13878, 14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800,
17166, 17531, 17896, 10956, 11322, 11687, 12052, 12417, 12783,
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070,
16435, 16800, 17166, 17531, 17896, 10956, 11322, 11687, 12052,
12417, 12783, 13148, 13513, 13878, 14244, 14609, 14974, 15339,
15705, 16070, 16435, 16800, 17166, 17531, 17896, 10956, 11322,
11687, 12052, 12417, 12783, 13148, 13513, 13878, 17166, 14244,
14609, 14974, 15339, 15705, 16070, 16435, 16800, 17531, 17896,
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878,
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166,
17531, 17896, 10864, 11230, 11595, 11960, 12325, 12691, 13056,
13421, 13786, 14152, 14517, 14882, 15247, 15613, 15978, 16343,
16708, 17074, 10622, 10987, 11353, 11718, 12083, 12448, 12814,
13179, 13544, 13909, 14275, 14640, 15005, 15370, 15736, 16101,
16466, 16831, 17197, 17562, 17927, 10956, 11322, 11687, 12052,
12417, 12783, 13148, 13513, 13878, 14244, 14609, 14609, 14974,
15339, 15705, 16070, 16435, 16800, 17166, 17531, 17896), class = "Date"),
year = c(1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L,
2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L,
2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L,
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L,
2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L,
2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2016L, 2008L, 2009L,
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2017L, 2018L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L,
2016L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L,
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2009L, 2010L, 2011L,
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L)), row.names = c(NA,
-167L), .internal.selfref = <pointer: 0x0000000002511ef0>, class = "data.frame")
#
#
I think this solved my issue: 我认为这解决了我的问题:
df1$start_date <- df1$date_f + 183
df1$end_date <- df1$date_f + 540
library(fuzzyjoin)
yy <- fuzzy_left_join(
df1, df2,
by = c(
"ID" = "ID",
"start_date" = "date",
"end_date" = "date"
),
match_fun = list(`==`, `<`, `>=`)
)
If anybody sees that I might of failed in my logic please correct me! 如果有人发现我的逻辑可能会失败,请纠正我!
If a financial report is released in March and the financial information is released in July, I want to ignore this join. 如果三月份发布财务报告而七月份发布财务信息,那么我想忽略此联接。 Hence the start_date <- df1$date_f + 183
. 因此, start_date <- df1$date_f + 183
。 I also set the upper bound to be 1.5 years (540 days) from the release of the financial report. 我还将上限设置为自财务报告发布之日起1.5年(540天)。 Therefore the following years report Will align correctly with the correct financial information. 因此,接下来的年度报告将正确地与正确的财务信息保持一致。
A sample of an output: 输出样本:
ID.x date_f start_date end_date ID.y date fyear
1 1006835 2008-09-30 2009-04-01 2010-03-24 NA <NA> NA
2 1510295 2009-10-19 2010-04-20 2011-04-12 1510295 2010-12-31 2010
3 1506307 2016-02-08 2016-08-09 2017-08-01 1506307 2016-12-31 2016
4 814453 2005-03-15 2005-09-14 2006-09-06 814453 2005-12-31 2005
5 832988 2003-06-19 2003-12-19 2004-12-10 832988 2004-01-31 2003
6 1275283 2007-02-26 2007-08-28 2008-08-19 1275283 2007-12-31 2007
7 858470 2004-03-15 2004-09-14 2005-09-06 858470 2004-12-31 2004
8 885639 2005-03-14 2005-09-13 2006-09-05 885639 2006-01-31 2005
9 732718 2014-04-02 2014-10-02 2015-09-24 NA <NA> NA
10 1385157 2009-03-02 2009-09-01 2010-08-24 1385157 2009-09-30 2009
Ie the ID.x
= 1510295
has a date.f
= 2009-10-19
and joining by year
would give me probably the financial information at 2009-12-31
which is only 2 months after the report… (which isn`t very useful to me) 即ID.x
= 1510295
有一个date.f
= 2009-10-19
,按year
加入可能会给我提供2009-12-31
的财务信息,该信息仅在报告发布后2个月……(不是很清楚)对我有用)
I create the bounds of start_date
= 2010-04-20
and end_date
= 2011-04-12
. 我创建了start_date
= 2010-04-20
和end_date
= 2011-04-12
的边界。 Where the start_date
is now greater than the 2009
end of year financial information 2009-12-31
. 现在start_date
大于2009
年末财务信息2009-12-31
。
Using fuzzyjoin
to match the date
from df2
to the bounds in df1
I (think) I am able to join them. 使用fuzzyjoin
将df2
中的date
与df1
的范围进行匹配(我想),我能够将它们加入。
###############################################################################
If somebody has a data.table
version, that would be great as this method has memory issues. 如果某人具有data.table
版本,那将是很好的选择,因为此方法存在内存问题。
setDT(df2)[df1, on = .(ID, date > start_date, date <= end_date)]
Doesn`t work as expected... 不能按预期工作...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.