(This is a follow on question to the one originally posted here .
My original problem has been solved however I would like to merge these two data sets in a slightly different way also.
df1
is some sort of "financial report" data and df2
is some sort of "end of year financial data". Previously I wanted to link the financial report data up to the last available financial data.
Now I would like to "forecast" the financial data ( df2
) using the data in the financial report ( df1
). That is link the data by ID
and date_f
and date
.
I would like to impose the following condition:
Join if;
date
( t+1
) from df2
> date_f
( t
) from df1
and the difference must be more than 6 months,
otherwise;
take date
at t+2
. (That is, if the date
( t+1
) is less than 6 months after date_f
( t
) then use the date
at ( t+2
).
Essentially I want to use the financial report data df1
to forecast data in df2
but the information in df1
is not useful for forecasting 1 week into the future, therefore I would prefer to forecast the following years data.
The data looks like the following;
df1:
ID date_f
1 1047699 2014-03-03
2 858339 2007-03-01
3 1002910 2009-12-22
4 277135 2011-02-18
5 753308 2004-03-09
6 1018840 2008-02-26
7 1510295 2011-10-21
8 3133 2014-02-27
9 1467858 2010-02-26
10 865436 2004-11-05
df2:
ID date year
1 3133 1999-12-31 1999
2 3133 2000-12-31 2000
3 3133 2001-12-31 2001
4 3133 2002-12-31 2002
5 3133 2003-12-31 2003
6 3133 2004-12-31 2004
Expected output using the first 5 rows of df1
:
ID date_f date year
1 1047699 2014-03-03
2 858339 2007-03-01 2007-12-31 2007
3 1002910 2009-12-22 2010-12-31 2010 *
4 277135 2011-02-18 2011-12-31 2011
5 753308 2004-03-09 2004-12-31 2004
date
should be 2009-12-31
which is in df2
and it is still > date_f
(by 1 week) however the condition I would like to impose is that "it must be > date_f
and the date
be more than 6 months into the future (or 180 days). So here this observation fails on the second condition (since its just 1 week difference) so I would like to "forecast" the next years date which is 2010-12-31
. data1
df1 <- structure(list(ID = c(1047699L, 858339L, 1002910L, 277135L, 753308L,
1018840L, 1510295L, 3133L, 1467858L, 865436L), date_f = structure(c(16132,
13573, 14600, 15023, 12486, 13935, 15268, 16128, 14666, 12727
), class = "Date")), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x0000000002511ef0>, class = "data.frame")
data2
df2 <- structure(list(ID = c(3133L, 3133L, 3133L, 3133L, 3133L, 3133L,
3133L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L,
753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L,
753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L,
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 277135L,
277135L, 277135L, 277135L, 277135L, 277135L, 277135L, 277135L,
277135L, 277135L, 277135L, 277135L, 277135L, 277135L, 277135L,
277135L, 277135L, 277135L, 277135L, 277135L, 1002910L, 1002910L,
1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L,
1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L,
1002910L, 1002910L, 1002910L, 1002910L, 858339L, 858339L, 858339L,
858339L, 858339L, 858339L, 858339L, 858339L, 858339L, 858339L,
858339L, 858339L, 858339L, 858339L, 858339L, 858339L, 858339L,
858339L, 858339L, 858339L, 865436L, 865436L, 865436L, 865436L,
865436L, 865436L, 865436L, 865436L, 865436L, 865436L, 865436L,
865436L, 865436L, 865436L, 865436L, 865436L, 865436L, 865436L,
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L,
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L,
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L,
1047699L, 1047699L, 1047699L, 1047699L, 1047699L, 1047699L, 1047699L,
1047699L, 1047699L, 1047699L, 1047699L, 1510295L, 1510295L, 1510295L,
1510295L, 1510295L, 1510295L, 1510295L, 1510295L, 1510295L, 1510295L
), date = structure(c(10956, 11322, 11687, 12052, 12417, 12783,
13148, 10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513,
13878, 14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800,
17166, 17531, 17896, 10956, 11322, 11687, 12052, 12417, 12783,
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070,
16435, 16800, 17166, 17531, 17896, 10956, 11322, 11687, 12052,
12417, 12783, 13148, 13513, 13878, 14244, 14609, 14974, 15339,
15705, 16070, 16435, 16800, 17166, 17531, 17896, 10956, 11322,
11687, 12052, 12417, 12783, 13148, 13513, 13878, 17166, 14244,
14609, 14974, 15339, 15705, 16070, 16435, 16800, 17531, 17896,
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878,
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166,
17531, 17896, 10864, 11230, 11595, 11960, 12325, 12691, 13056,
13421, 13786, 14152, 14517, 14882, 15247, 15613, 15978, 16343,
16708, 17074, 10622, 10987, 11353, 11718, 12083, 12448, 12814,
13179, 13544, 13909, 14275, 14640, 15005, 15370, 15736, 16101,
16466, 16831, 17197, 17562, 17927, 10956, 11322, 11687, 12052,
12417, 12783, 13148, 13513, 13878, 14244, 14609, 14609, 14974,
15339, 15705, 16070, 16435, 16800, 17166, 17531, 17896), class = "Date"),
year = c(1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L,
2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L,
2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L,
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L,
2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L,
2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2016L, 2008L, 2009L,
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2017L, 2018L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L,
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L,
2016L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L,
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L,
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2009L, 2010L, 2011L,
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L)), row.names = c(NA,
-167L), .internal.selfref = <pointer: 0x0000000002511ef0>, class = "data.frame")
#
I think this solved my issue:
df1$start_date <- df1$date_f + 183
df1$end_date <- df1$date_f + 540
library(fuzzyjoin)
yy <- fuzzy_left_join(
df1, df2,
by = c(
"ID" = "ID",
"start_date" = "date",
"end_date" = "date"
),
match_fun = list(`==`, `<`, `>=`)
)
If anybody sees that I might of failed in my logic please correct me!
If a financial report is released in March and the financial information is released in July, I want to ignore this join. Hence the start_date <- df1$date_f + 183
. I also set the upper bound to be 1.5 years (540 days) from the release of the financial report. Therefore the following years report Will align correctly with the correct financial information.
A sample of an output:
ID.x date_f start_date end_date ID.y date fyear
1 1006835 2008-09-30 2009-04-01 2010-03-24 NA <NA> NA
2 1510295 2009-10-19 2010-04-20 2011-04-12 1510295 2010-12-31 2010
3 1506307 2016-02-08 2016-08-09 2017-08-01 1506307 2016-12-31 2016
4 814453 2005-03-15 2005-09-14 2006-09-06 814453 2005-12-31 2005
5 832988 2003-06-19 2003-12-19 2004-12-10 832988 2004-01-31 2003
6 1275283 2007-02-26 2007-08-28 2008-08-19 1275283 2007-12-31 2007
7 858470 2004-03-15 2004-09-14 2005-09-06 858470 2004-12-31 2004
8 885639 2005-03-14 2005-09-13 2006-09-05 885639 2006-01-31 2005
9 732718 2014-04-02 2014-10-02 2015-09-24 NA <NA> NA
10 1385157 2009-03-02 2009-09-01 2010-08-24 1385157 2009-09-30 2009
Ie the ID.x
= 1510295
has a date.f
= 2009-10-19
and joining by year
would give me probably the financial information at 2009-12-31
which is only 2 months after the report… (which isn`t very useful to me)
I create the bounds of start_date
= 2010-04-20
and end_date
= 2011-04-12
. Where the start_date
is now greater than the 2009
end of year financial information 2009-12-31
.
Using fuzzyjoin
to match the date
from df2
to the bounds in df1
I (think) I am able to join them.
###############################################################################
If somebody has a data.table
version, that would be great as this method has memory issues.
setDT(df2)[df1, on = .(ID, date > start_date, date <= end_date)]
Doesn`t work as expected...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.