简体   繁体   English

如果在日期之间比较 2 个数据框以添加行

[英]Comparing 2 dataframes to add rows if between dates

completely new here, I tried looking up my problem but couldn't find anything quite similar!这里是全新的,我尝试查找我的问题,但找不到任何非常相似的东西!

I'm trying to set up a dataframe that contains the data for a schedule and its activity types.我正在尝试设置一个包含时间表及其活动类型数据的数据框。 For example, if it's '1' it's a normal activity, and if it's '2' it's canceled, and compare that dataframe to another one to see if there is a date between the start/end date in the first dataframe, and if so, modify it so that it becomes 3 rows instead of 1, having the first Start/End date row until that said holiday, the holiday date row, and then the Start/End date continuing after the holiday.例如,如果它是“1”,它是一个正常的活动,如果它是“2”,它被取消,然后将该数据帧与另一个数据帧进行比较,以查看第一个数据帧中的开始/结束日期之间是否存在日期,如果是,将其修改为 3 行而不是 1 行,第一个开始/结束日期行直到该假期,假期日期行,然后是假期后继续的开始/结束日期。

I have no problem creating a single data frame, however my problem arises when I want to compare another series/data frame and potentially add rows that could be between said StartDate and EndDate.我创建单个数据框没有问题,但是当我想比较另一个系列/数据框并可能添加可能位于所述 StartDate 和 EndDate 之间的行时,我的问题就出现了。

Example Schedule dataframe示例计划数据框

Start Date开始日期 End Date结束日期 Activity Type活动类型
2021-01-01 2021-01-01 2021-12-31 2021-12-31 1 1

When compared to the other dataframe与其他数据框相比

Holiday Start Date假期开始日期 Holiday End Date假期结束日期
2021-02-14 2021-02-14 2021-02-14 2021-02-14
2021-07-04 2021-07-04 2021-07-05 2021-07-05

Ending up like this:像这样结束:

Start Date开始日期 End Date结束日期 Activity Type活动类型
2021-01-01 2021-01-01 2021-02-13 2021-02-13 1 1
2021-02-14 2021-02-14 2021-02-14 2021-02-14 2 2
2021-02-15 2021-02-15 2021-07-03 2021-07-03 1 1
2021-07-04 2021-07-04 2021-07-04 2021-07-04 2 2
2021-07-05 2021-07-05 2021-12-31 2021-12-31 1 1

Any help is appreciated!任何帮助表示赞赏!

Thanks, S.谢谢,S。

To present a more instructive example, I created Schedule as containing multiple rows:为了提供一个更具启发性的示例,我创建了包含行的Schedule

  Start Date   End Date  Activity Type
0 2021-01-01 2021-05-31             10
1 2021-06-01 2021-12-31             20

I created Holidays as:我将假期创建为:

  Holiday Start Date Holiday End Date
0         2021-02-14       2021-02-14
1         2021-03-10       2021-03-12
2         2021-07-04       2021-07-06

All date columns are of datetime64 type.所有日期列都是datetime64类型。

A preparatory step is to create an IntervalIndex from Holidays :准备步骤是从Holidays创建一个IntervalIndex

ind = pd.IntervalIndex.from_arrays(Holidays['Holiday Start Date'],
    Holidays['Holiday End Date'], closed='both')

To get the result from a single row, create the following function:要从单行获取结果,请创建以下函数:

def getActivities(row):
    dd = pd.date_range(row['Start Date'], row['End Date'])
    ss = dd.to_series().apply(lambda dat: ind.contains(dat).any())
    s1 = ss[ss != ss.shift()]
    s2 = ss[ss != ss.shift(-1)]
    s1 = s1.astype(int) + row['Activity Type']
    rv = s1.astype(int).reset_index().rename(columns={'index': 'Start Date',
        0: 'Activity Type'})
    rv.insert(1, 'End Date', s2.index)
    return rv

To test this function you can call it on a single row, say, the initial row:要测试此函数,您可以在单行上调用它,例如,初始行:

getActivities(Schedule.iloc[0])

To understand fully all details, save a single row of Schedule under a variable:要完全了解所有详细信息,请在变量下保存一行Schedule

row = Schedule.iloc[0]

Then execute each instruction from getActivities and see the intermediate results.然后执行getActivities 中的每条指令并查看中间结果。

And to get the expected result for all rows, you have to concatenate results of application of this function to each row:为了获得所有行的预期结果,您必须将此函数的应用结果连接到每一行:

pd.concat(Schedule.apply(getActivities, axis=1).values, ignore_index=True)

For my test data, the result is:对于我的测试数据,结果是:

  Start Date   End Date  Activity Type
0 2021-01-01 2021-02-13             10
1 2021-02-14 2021-02-14             11
2 2021-02-15 2021-03-09             10
3 2021-03-10 2021-03-12             11
4 2021-03-13 2021-05-31             10
5 2021-06-01 2021-07-03             20
6 2021-07-04 2021-07-06             21
7 2021-07-07 2021-12-31             20

Fist 5 rows are from row 0 of Schedule , with 2 holiday periods.前 5 行来自Schedule 的0行,有 2 个假期。 Last 3 rows are from row 1 , with 1 holiday period.最后 3 行来自第1行,有 1 个假期。

Note that Activity Type is either the original value (for "normal" period) or the original value + 1 (for a holiday period), so Schedule should not contain consecutive values as Activity Type .请注意, Activity Type是原始值(对于“正常”期间)或原始值 + 1(对于假期期间),因此Schedule不应包含连续值作为Activity Type

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM