简体   繁体   English

Groupby Pandas ,根据日期差异计算多列

[英]Groupby Pandas , calculate multiple columns based on date difference

I have a pandas dataframe shown below:我有一个如下所示的熊猫数据框:

CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103 

I want to create three columns:我想创建三列:

days_diff:天数差异:

  1. This has to be created in a way that if the difference b/w the first Date and corresponding rows is greater than 30 belonging to the same CID then assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values这必须以这样的方式创建,如果第一个日期和相应行的差异大于 30 属于同一 CID,则将“NAT”或 0 分配给下一行(重置),然后减去日期此行用于以下值

  2. If MIDis not null and belong to same CID group assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values如果 MIDis 不为 null 且属于同一 CID 组,则将“NAT”或 0 分配给下一行(重置),然后用该行减去以下值的日期

Otherwise just fetch the date difference b/w the first row belonging to the same CID for the corresponding rows否则,只需获取属于相应行的相同 CID 的第一行的日期差异 b/w

A: This depends on the days_diff column , this column is like a counter it will only change/increment when there's another NAT occurrence for the same CID and reset itself for every CID.答:这取决于 days_diff 列,该列就像一个计数器,只有在同一 CID 发生另一个 NAT 时才会更改/递增,并为每个 CID 重置自身。

B: This column depends on the column A , if the value in A remains same it won't change otherwise increments B:此列取决于 A 列,如果 A 中的值保持不变,则不会更改,否则会增加

It's a bit complicated to explain please refer to the output below for reference.解释起来有点复杂,请参阅下面的输出以供参考。 I have used .groupby() .diff() and .shift() methods to create multiple dummy columns in order to calculate this and still working on it, please let me know the best way to go about this, thanks我已经使用.groupby() .diff().shift()方法来创建多个虚拟列,以便计算并仍在处理它,请让我知道最好的方法来解决这个问题,谢谢

My expected output :我的预期输出:

CID RefID   Date        Group   MID     days_diff       A   B
100     1   1/01/2021       A           NAT             1   1
100     2   3/01/2021       A           2 days          1   1
100     3   4/01/2021       A   101     3 days          1   1
100     4   15/01/2021      A           NAT             2   4
100     5   18/01/2021      A           3 days          2   4
200     6   3/03/2021       B           NAT             1   6
200     7   4/04/2021       B           NAT             2   7
200     8   9/04/2021       B   102     5 days          2   7
200     9   25/04/2021      B           NAT             3   9
300     10  26/04/2021      C           NAT             1   10
300     11  27/05/2021      C           NAT             2   11
300     12  28/05/2021      C   103     1 day           2   11

You could do something like this:你可以这样做:

def days_diff(sdf):
    result = pd.DataFrame(
        {"days_diff": pd.NaT, "A": None}, index=sdf.index
    )
    start = sdf.at[sdf.index[0], "Date"]
    for index, day, next_MID_is_na in zip(
        sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
    ):
        diff = (day - start).days
        if diff <= 30 and next_MID_is_na:
            result.at[index, "days_diff"] = diff
        else:
            start = day
    result.A = result.days_diff.isna().cumsum()
    return result

df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()

Result for df created bydf创建的结果

from io import StringIO
data = StringIO(
'''
CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")

is

    CID  RefID       Date Group    MID days_diff  A     B
0   100      1 2021-01-01     A    NaN       NaT  1   1.0
1   100      2 2021-01-03     A    NaN         2  1   1.0
2   100      3 2021-01-04     A  101.0         3  1   1.0
3   100      4 2021-01-15     A    NaN       NaT  2   4.0
4   100      5 2021-01-18     A    NaN         3  2   4.0
5   200      6 2021-03-03     B    NaN       NaT  1   6.0
6   200      7 2021-04-04     B    NaN       NaT  2   7.0
7   200      8 2021-04-09     B  102.0         5  2   7.0
8   200      9 2021-04-25     B    NaN       NaT  3   9.0
9   300     10 2021-04-26     C    NaN       NaT  1  10.0
10  300     11 2021-05-27     C    NaN       NaT  2  11.0
11  300     12 2021-05-28     C  103.0         1  2  11.0

A few explanations:几个解释:

  • The function days_diff produces a dataframe with the two columns days_diff and A .函数days_diff生成一个包含days_diffA两列的数据days_diff It is applied to the grouped by column CID sub-dataframes of df .它被施加到通过柱分组CID的子dataframes df
  • First step: Initializing the result dataframe result (column days_diff filled with NaT , column A with None ), and setting the starting value start for the day differences to the first day in the group.第一步:初始化结果数据days_diff result (列days_diff填充NaT ,列A填充None ),并将天差的起始值start设置为组中的第一天。
  • Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column Date , and a boolean value next_MID_is_na that signifies if the value of the MID column in the next row ist NaN (via .shift(1).isna() ).之后基本上循环遍历第一个索引之后的子数据帧,从而获取索引、列Date的值和布尔值next_MID_is_na表示下一行中MID列的值是否为NaN (通过.shift(1).isna() )。
  • In every step of the loop:在循环的每一步中:
    1. Calculation of the difference of the current day to the start day.计算当天与开始日的差值。
    2. Checking the rules for the days_diff column:检查days_diff列的规则:
      • If difference of current and start day <= 30 days and NaN in next MID -row -> day-difference.如果当前和开始日期的差异 <= 30 天,并且下一个MID -row -> day-difference 为NaN
      • Otherwise -> reset of start to the current day.否则 -> 将start重置为当天。
  • After finishing column days_diff calculation of column A : result.days_diff.isna() is True ( == 1 ) when days_diff is NaN , False ( == 0 ) otherwise.完成列Adays_diff计算后:当days_diffNaNdays_diff result.days_diff.isna()True ( == 1 ), days_diffFalse ( == 0 )。 Therefore the cummulative sum ( .cumsum() ) gives the required result.因此,累积总和 ( .cumsum() ) 给出了所需的结果。
  • After the groupby-apply to produce the columns days_diff and A finally the calculation of column B : Selection of RefID -values where the values A change (via .where(df.A != df.A.shift(1)) ), and then forward filling the remaining NaN s.groupby-apply生成列days_diffA最后的列B的计算之后:选择RefID值,其中值A更改(通过.where(df.A != df.A.shift(1)) ),然后向前填充剩余的NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM