简体   繁体   中英

Groupby Pandas , calculate multiple columns based on date difference

I have a pandas dataframe shown below:

CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103 

I want to create three columns:

days_diff:

  1. This has to be created in a way that if the difference b/w the first Date and corresponding rows is greater than 30 belonging to the same CID then assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values

  2. If MIDis not null and belong to same CID group assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values

Otherwise just fetch the date difference b/w the first row belonging to the same CID for the corresponding rows

A: This depends on the days_diff column , this column is like a counter it will only change/increment when there's another NAT occurrence for the same CID and reset itself for every CID.

B: This column depends on the column A , if the value in A remains same it won't change otherwise increments

It's a bit complicated to explain please refer to the output below for reference. I have used .groupby() .diff() and .shift() methods to create multiple dummy columns in order to calculate this and still working on it, please let me know the best way to go about this, thanks

My expected output :

CID RefID   Date        Group   MID     days_diff       A   B
100     1   1/01/2021       A           NAT             1   1
100     2   3/01/2021       A           2 days          1   1
100     3   4/01/2021       A   101     3 days          1   1
100     4   15/01/2021      A           NAT             2   4
100     5   18/01/2021      A           3 days          2   4
200     6   3/03/2021       B           NAT             1   6
200     7   4/04/2021       B           NAT             2   7
200     8   9/04/2021       B   102     5 days          2   7
200     9   25/04/2021      B           NAT             3   9
300     10  26/04/2021      C           NAT             1   10
300     11  27/05/2021      C           NAT             2   11
300     12  28/05/2021      C   103     1 day           2   11

You could do something like this:

def days_diff(sdf):
    result = pd.DataFrame(
        {"days_diff": pd.NaT, "A": None}, index=sdf.index
    )
    start = sdf.at[sdf.index[0], "Date"]
    for index, day, next_MID_is_na in zip(
        sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
    ):
        diff = (day - start).days
        if diff <= 30 and next_MID_is_na:
            result.at[index, "days_diff"] = diff
        else:
            start = day
    result.A = result.days_diff.isna().cumsum()
    return result

df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()

Result for df created by

from io import StringIO
data = StringIO(
'''
CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")

is

    CID  RefID       Date Group    MID days_diff  A     B
0   100      1 2021-01-01     A    NaN       NaT  1   1.0
1   100      2 2021-01-03     A    NaN         2  1   1.0
2   100      3 2021-01-04     A  101.0         3  1   1.0
3   100      4 2021-01-15     A    NaN       NaT  2   4.0
4   100      5 2021-01-18     A    NaN         3  2   4.0
5   200      6 2021-03-03     B    NaN       NaT  1   6.0
6   200      7 2021-04-04     B    NaN       NaT  2   7.0
7   200      8 2021-04-09     B  102.0         5  2   7.0
8   200      9 2021-04-25     B    NaN       NaT  3   9.0
9   300     10 2021-04-26     C    NaN       NaT  1  10.0
10  300     11 2021-05-27     C    NaN       NaT  2  11.0
11  300     12 2021-05-28     C  103.0         1  2  11.0

A few explanations:

  • The function days_diff produces a dataframe with the two columns days_diff and A . It is applied to the grouped by column CID sub-dataframes of df .
  • First step: Initializing the result dataframe result (column days_diff filled with NaT , column A with None ), and setting the starting value start for the day differences to the first day in the group.
  • Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column Date , and a boolean value next_MID_is_na that signifies if the value of the MID column in the next row ist NaN (via .shift(1).isna() ).
  • In every step of the loop:
    1. Calculation of the difference of the current day to the start day.
    2. Checking the rules for the days_diff column:
      • If difference of current and start day <= 30 days and NaN in next MID -row -> day-difference.
      • Otherwise -> reset of start to the current day.
  • After finishing column days_diff calculation of column A : result.days_diff.isna() is True ( == 1 ) when days_diff is NaN , False ( == 0 ) otherwise. Therefore the cummulative sum ( .cumsum() ) gives the required result.
  • After the groupby-apply to produce the columns days_diff and A finally the calculation of column B : Selection of RefID -values where the values A change (via .where(df.A != df.A.shift(1)) ), and then forward filling the remaining NaN s.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM