I have a data frame that looks like this.
ID | Start | End |
---|---|---|
1 | 2020-12-13 | 2020-12-20 |
1 | 2020-12-26 | 2021-01-20 |
1 | 2020-02-20 | 2020-02-21 |
2 | 2020-12-13 | 2020-12-20 |
2 | 2021-01-11 | 2021-01-20 |
2 | 2021-02-15 | 2021-02-26 |
Using pandas, I am trying to group by ID and then subtract the start date from a current row from the end date of the previous row.
If the difference is greater than 5 then it should return True
I'm new to pandas, and I've been trying to figure this out all day.
Two assumptions:
So I am starting with this dataframe to which I added the column 'above_5_days'.
df
ID start end above_5_days
0 1 2020-12-13 2020-12-20 None
1 1 2020-12-26 2021-01-20 None
2 1 2020-02-20 2020-02-21 None
3 2 2020-12-13 2020-12-20 None
4 2 2021-01-11 2021-01-20 None
5 2 2021-02-15 2021-02-26 None
this will be the groupby object that will be used to apply the operation on each ID-group
id_grp = df.groupby("ID")
the following is the operation that will be applied on each subset
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
# if you don't want the absolute difference, remove .abs()
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
return x
Now apply this to the whole group and store it in a newdf
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 False
1 1 2020-12-26 2021-01-20 True
2 1 2020-02-20 2020-02-21 True
3 2 2020-12-13 2020-12-20 False
4 2 2021-01-11 2021-01-20 True
5 2 2021-02-15 2021-02-26 True
>>>>>>> I should point out that:
in this case, there are only False values because shifting down the end column for each group will make a NaN value in the first row of the column, which returns a NaN value when subtracted from. So the False values are just the boolean versions of None.
That is why, I would personally change the function to:
def calc_diff(x):
# this shifts the end times down by one row to align the current start with the previous end
to_subtract_from = x["end"].shift(periods=1)
diff = to_subtract_from - x["start"] # subtract the start date from the previous end
# sets the new column to True/False depending on condition
x["above_5_days"] = diff.abs() > to_timedelta(5, unit="D")
x.loc[to_subtract_from.isna(), "above_5_days"] = None
return x
When rerunning this, you can see that the extra line right before the return statement will set the value in the new column to NaN if the shifted end times are NaN.
newdf = id_grp.apply(calc_diff)
newdf
ID start end above_5_days
0 1 2020-12-13 2020-12-20 NaN
1 1 2020-12-26 2021-01-20 1.0
2 1 2020-02-20 2020-02-21 1.0
3 2 2020-12-13 2020-12-20 NaN
4 2 2021-01-11 2021-01-20 1.0
5 2 2021-02-15 2021-02-26 1.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.