简体   繁体   中英

How to calculate date difference between rows in pandas

I have a data frame that looks like this.

ID Start End
1 2020-12-13 2020-12-20
1 2020-12-26 2021-01-20
1 2020-02-20 2020-02-21
2 2020-12-13 2020-12-20
2 2021-01-11 2021-01-20
2 2021-02-15 2021-02-26

Using pandas, I am trying to group by ID and then subtract the start date from a current row from the end date of the previous row.

If the difference is greater than 5 then it should return True

I'm new to pandas, and I've been trying to figure this out all day.

Two assumptions:

  1. By difference greater than 5, you mean 5 days
  2. You mean the absolute difference

So I am starting with this dataframe to which I added the column 'above_5_days'.

df
   ID      start        end above_5_days
0   1 2020-12-13 2020-12-20         None
1   1 2020-12-26 2021-01-20         None
2   1 2020-02-20 2020-02-21         None
3   2 2020-12-13 2020-12-20         None
4   2 2021-01-11 2021-01-20         None
5   2 2021-02-15 2021-02-26         None

this will be the groupby object that will be used to apply the operation on each ID-group

id_grp = df.groupby("ID")

the following is the operation that will be applied on each subset

def calc_diff(x):

    # this shifts the end times down by one row to align the current start with the previous end
    to_subtract_from = x["end"].shift(periods=1) 
    diff = to_subtract_from - x["start"] # subtract the start date from the previous end

    # sets the new column to True/False depending on condition
    # if you don't want the absolute difference, remove .abs()
    x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D") 
    return x

Now apply this to the whole group and store it in a newdf

newdf = id_grp.apply(calc_diff)
newdf
   ID      start        end  above_5_days
0   1 2020-12-13 2020-12-20         False
1   1 2020-12-26 2021-01-20          True
2   1 2020-02-20 2020-02-21          True
3   2 2020-12-13 2020-12-20         False
4   2 2021-01-11 2021-01-20          True
5   2 2021-02-15 2021-02-26          True

>>>>>>> I should point out that:

in this case, there are only False values because shifting down the end column for each group will make a NaN value in the first row of the column, which returns a NaN value when subtracted from. So the False values are just the boolean versions of None.

That is why, I would personally change the function to:

def calc_diff(x):

    # this shifts the end times down by one row to align the current start with the previous end
    to_subtract_from = x["end"].shift(periods=1) 
    diff = to_subtract_from - x["start"] # subtract the start date from the previous end

    # sets the new column to True/False depending on condition
    x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D") 
    x.loc[to_subtract_from.isna(), "above_5_days"] = None
    return x

When rerunning this, you can see that the extra line right before the return statement will set the value in the new column to NaN if the shifted end times are NaN.

newdf = id_grp.apply(calc_diff)
newdf
   ID      start        end  above_5_days
0   1 2020-12-13 2020-12-20           NaN
1   1 2020-12-26 2021-01-20           1.0
2   1 2020-02-20 2020-02-21           1.0
3   2 2020-12-13 2020-12-20           NaN
4   2 2021-01-11 2021-01-20           1.0
5   2 2021-02-15 2021-02-26           1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM