简体   繁体   English

根据时间段条件按日期联接数据

[英]joining data by date based on time period condition

(This is a follow on question to the one originally posted here . (这是对最初在此处发布的内容的质疑。

My original problem has been solved however I would like to merge these two data sets in a slightly different way also. 我原来的问题已解决,但是我也想以稍微不同的方式合并这两个数据集。

df1 is some sort of "financial report" data and df2 is some sort of "end of year financial data". df1是某种“财务报告”数据,而df2是某种“年终财务数据”。 Previously I wanted to link the financial report data up to the last available financial data. 以前,我想将财务报告数据链接到最新的可用财务数据。

Now I would like to "forecast" the financial data ( df2 ) using the data in the financial report ( df1 ). 现在,我想使用财务报告( df1 )中的数据“预测”财务数据( df2 )。 That is link the data by ID and date_f and date . 那就是通过IDdate_fdate链接数据。

I would like to impose the following condition: 我要施加以下条件:

Join if; 如果加入;

date ( t+1 ) from df2 > date_f ( t ) from df1 and the difference must be more than 6 months, 来自df2 datet+1 )>来自df1 date_ft ),且相差必须超过6个月,

otherwise; 除此以外;

take date at t+2 . t+2date (That is, if the date ( t+1 ) is less than 6 months after date_f ( t ) then use the date at ( t+2 ). (也就是说,如果datet+1 )小于date_ft )之后的6个月,则使用datet+2 )。

Essentially I want to use the financial report data df1 to forecast data in df2 but the information in df1 is not useful for forecasting 1 week into the future, therefore I would prefer to forecast the following years data. 本质上,我想使用财务报告数据df1来预测df2数据,但是df1的信息对于预测未来1周没有用,因此,我希望预测接下来的几年数据。

The data looks like the following; 数据如下所示;

df1: df1:

        ID     date_f
1  1047699 2014-03-03
2   858339 2007-03-01
3  1002910 2009-12-22
4   277135 2011-02-18
5   753308 2004-03-09
6  1018840 2008-02-26
7  1510295 2011-10-21
8     3133 2014-02-27
9  1467858 2010-02-26
10  865436 2004-11-05

df2: df2:

    ID       date year
1 3133 1999-12-31 1999
2 3133 2000-12-31 2000
3 3133 2001-12-31 2001
4 3133 2002-12-31 2002
5 3133 2003-12-31 2003
6 3133 2004-12-31 2004

Expected output using the first 5 rows of df1 : 使用df1的前5行的预期输出:

        ID     date_f   date        year
1  1047699 2014-03-03 
2   858339 2007-03-01   2007-12-31  2007
3  1002910 2009-12-22   2010-12-31  2010 *
4   277135 2011-02-18   2011-12-31  2011
5   753308 2004-03-09   2004-12-31  2004
  • Here the date should be 2009-12-31 which is in df2 and it is still > date_f (by 1 week) however the condition I would like to impose is that "it must be > date_f and the date be more than 6 months into the future (or 180 days). So here this observation fails on the second condition (since its just 1 week difference) so I would like to "forecast" the next years date which is 2010-12-31 . 这里的date应该是df2 2009-12-31 ,并且仍然> date_f (到1周),但是我要施加的条件是“它必须> date_f ,并且date必须超过6个月未来(或180天)。因此,此观察在第二种条件下失败(因为相差仅1周),因此我想“预测”明年的日期为2010-12-31

data1 数据1

df1 <- structure(list(ID = c(1047699L, 858339L, 1002910L, 277135L, 753308L, 
1018840L, 1510295L, 3133L, 1467858L, 865436L), date_f = structure(c(16132, 
13573, 14600, 15023, 12486, 13935, 15268, 16128, 14666, 12727
), class = "Date")), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x0000000002511ef0>, class = "data.frame")

data2 数据2

df2 <- structure(list(ID = c(3133L, 3133L, 3133L, 3133L, 3133L, 3133L, 
3133L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 
753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 
753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 753308L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 277135L, 
277135L, 277135L, 277135L, 277135L, 277135L, 277135L, 277135L, 
277135L, 277135L, 277135L, 277135L, 277135L, 277135L, 277135L, 
277135L, 277135L, 277135L, 277135L, 277135L, 1002910L, 1002910L, 
1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 
1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 1002910L, 
1002910L, 1002910L, 1002910L, 1002910L, 858339L, 858339L, 858339L, 
858339L, 858339L, 858339L, 858339L, 858339L, 858339L, 858339L, 
858339L, 858339L, 858339L, 858339L, 858339L, 858339L, 858339L, 
858339L, 858339L, 858339L, 865436L, 865436L, 865436L, 865436L, 
865436L, 865436L, 865436L, 865436L, 865436L, 865436L, 865436L, 
865436L, 865436L, 865436L, 865436L, 865436L, 865436L, 865436L, 
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 
1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 1018840L, 
1047699L, 1047699L, 1047699L, 1047699L, 1047699L, 1047699L, 1047699L, 
1047699L, 1047699L, 1047699L, 1047699L, 1510295L, 1510295L, 1510295L, 
1510295L, 1510295L, 1510295L, 1510295L, 1510295L, 1510295L, 1510295L
), date = structure(c(10956, 11322, 11687, 12052, 12417, 12783, 
13148, 10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 
13878, 14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 
17166, 17531, 17896, 10956, 11322, 11687, 12052, 12417, 12783, 
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070, 
16435, 16800, 17166, 17531, 17896, 10956, 11322, 11687, 12052, 
12417, 12783, 13148, 13513, 13878, 14244, 14609, 14974, 15339, 
15705, 16070, 16435, 16800, 17166, 17531, 17896, 10956, 11322, 
11687, 12052, 12417, 12783, 13148, 13513, 13878, 17166, 14244, 
14609, 14974, 15339, 15705, 16070, 16435, 16800, 17531, 17896, 
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878, 
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166, 
17531, 17896, 10864, 11230, 11595, 11960, 12325, 12691, 13056, 
13421, 13786, 14152, 14517, 14882, 15247, 15613, 15978, 16343, 
16708, 17074, 10622, 10987, 11353, 11718, 12083, 12448, 12814, 
13179, 13544, 13909, 14275, 14640, 15005, 15370, 15736, 16101, 
16466, 16831, 17197, 17562, 17927, 10956, 11322, 11687, 12052, 
12417, 12783, 13148, 13513, 13878, 14244, 14609, 14609, 14974, 
15339, 15705, 16070, 16435, 16800, 17166, 17531, 17896), class = "Date"), 
    year = c(1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
    1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 
    2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 
    2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
    2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 
    2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L, 
    2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 
    2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L, 
    2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2016L, 2008L, 2009L, 
    2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2017L, 2018L, 1999L, 
    2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
    2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
    2018L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 
    2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 
    2016L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
    2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 
    2015L, 2016L, 2017L, 2018L, 1999L, 2000L, 2001L, 2002L, 2003L, 
    2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2009L, 2010L, 2011L, 
    2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L)), row.names = c(NA, 
-167L), .internal.selfref = <pointer: 0x0000000002511ef0>, class = "data.frame")
#

I think this solved my issue: 我认为这解决了我的问题:

df1$start_date <- df1$date_f + 183
df1$end_date <- df1$date_f + 540

library(fuzzyjoin)
yy <- fuzzy_left_join(
  df1, df2,
  by = c(
    "ID" = "ID",
    "start_date" = "date",
    "end_date" = "date"
  ),
  match_fun = list(`==`, `<`, `>=`)
)

If anybody sees that I might of failed in my logic please correct me! 如果有人发现我的逻辑可能会失败,请纠正我!

If a financial report is released in March and the financial information is released in July, I want to ignore this join. 如果三月份发布财务报告而七月份发布财务信息,那么我想忽略此联接。 Hence the start_date <- df1$date_f + 183 . 因此, start_date <- df1$date_f + 183 I also set the upper bound to be 1.5 years (540 days) from the release of the financial report. 我还将上限设置为自财务报告发布之日起1.5年(540天)。 Therefore the following years report Will align correctly with the correct financial information. 因此,接下来的年度报告将正确地与正确的财务信息保持一致。

A sample of an output: 输出样本:

     ID.x     date_f    start_date end_date   ID.y      date     fyear
1  1006835  2008-09-30 2009-04-01 2010-03-24      NA       <NA>    NA
2  1510295  2009-10-19 2010-04-20 2011-04-12 1510295 2010-12-31  2010
3  1506307  2016-02-08 2016-08-09 2017-08-01 1506307 2016-12-31  2016
4   814453  2005-03-15 2005-09-14 2006-09-06  814453 2005-12-31  2005
5   832988  2003-06-19 2003-12-19 2004-12-10  832988 2004-01-31  2003
6  1275283  2007-02-26 2007-08-28 2008-08-19 1275283 2007-12-31  2007
7   858470  2004-03-15 2004-09-14 2005-09-06  858470 2004-12-31  2004
8   885639  2005-03-14 2005-09-13 2006-09-05  885639 2006-01-31  2005
9   732718  2014-04-02 2014-10-02 2015-09-24      NA       <NA>    NA
10 1385157  2009-03-02 2009-09-01 2010-08-24 1385157 2009-09-30  2009

Ie the ID.x = 1510295 has a date.f = 2009-10-19 and joining by year would give me probably the financial information at 2009-12-31 which is only 2 months after the report… (which isn`t very useful to me) ID.x = 1510295有一个date.f = 2009-10-19 ,按year加入可能会给我提供2009-12-31的财务信息,该信息仅在报告发布后2个月……(不是很清楚)对我有用)

I create the bounds of start_date = 2010-04-20 and end_date = 2011-04-12 . 我创建了start_date = 2010-04-20end_date = 2011-04-12的边界。 Where the start_date is now greater than the 2009 end of year financial information 2009-12-31 . 现在start_date大于2009年末财务信息2009-12-31

Using fuzzyjoin to match the date from df2 to the bounds in df1 I (think) I am able to join them. 使用fuzzyjoindf2中的datedf1的范围进行匹配(我想),我能够将它们加入。

###############################################################################

If somebody has a data.table version, that would be great as this method has memory issues. 如果某人具有data.table版本,那将是很好的选择,因为此方法存在内存问题。

setDT(df2)[df1, on = .(ID, date > start_date, date <= end_date)]

Doesn`t work as expected... 不能按预期工作...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM