[英]Groupby Pandas , calculate multiple columns based on date difference
I have a pandas dataframe shown below:我有一个如下所示的熊猫数据框:
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
I want to create three columns:我想创建三列:
days_diff:天数差异:
This has to be created in a way that if the difference b/w the first Date and corresponding rows is greater than 30 belonging to the same CID then assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values这必须以这样的方式创建,如果第一个日期和相应行的差异大于 30 属于同一 CID,则将“NAT”或 0 分配给下一行(重置),然后减去日期此行用于以下值
If MIDis not null and belong to same CID group assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values如果 MIDis 不为 null 且属于同一 CID 组,则将“NAT”或 0 分配给下一行(重置),然后用该行减去以下值的日期
Otherwise just fetch the date difference b/w the first row belonging to the same CID for the corresponding rows否则,只需获取属于相应行的相同 CID 的第一行的日期差异 b/w
A: This depends on the days_diff column , this column is like a counter it will only change/increment when there's another NAT occurrence for the same CID and reset itself for every CID.答:这取决于 days_diff 列,该列就像一个计数器,只有在同一 CID 发生另一个 NAT 时才会更改/递增,并为每个 CID 重置自身。
B: This column depends on the column A , if the value in A remains same it won't change otherwise increments B:此列取决于 A 列,如果 A 中的值保持不变,则不会更改,否则会增加
It's a bit complicated to explain please refer to the output below for reference.解释起来有点复杂,请参阅下面的输出以供参考。 I have used
.groupby()
.diff()
and .shift()
methods to create multiple dummy columns in order to calculate this and still working on it, please let me know the best way to go about this, thanks我已经使用
.groupby()
.diff()
和.shift()
方法来创建多个虚拟列,以便计算并仍在处理它,请让我知道最好的方法来解决这个问题,谢谢
My expected output :我的预期输出:
CID RefID Date Group MID days_diff A B
100 1 1/01/2021 A NAT 1 1
100 2 3/01/2021 A 2 days 1 1
100 3 4/01/2021 A 101 3 days 1 1
100 4 15/01/2021 A NAT 2 4
100 5 18/01/2021 A 3 days 2 4
200 6 3/03/2021 B NAT 1 6
200 7 4/04/2021 B NAT 2 7
200 8 9/04/2021 B 102 5 days 2 7
200 9 25/04/2021 B NAT 3 9
300 10 26/04/2021 C NAT 1 10
300 11 27/05/2021 C NAT 2 11
300 12 28/05/2021 C 103 1 day 2 11
You could do something like this:你可以这样做:
def days_diff(sdf):
result = pd.DataFrame(
{"days_diff": pd.NaT, "A": None}, index=sdf.index
)
start = sdf.at[sdf.index[0], "Date"]
for index, day, next_MID_is_na in zip(
sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
):
diff = (day - start).days
if diff <= 30 and next_MID_is_na:
result.at[index, "days_diff"] = diff
else:
start = day
result.A = result.days_diff.isna().cumsum()
return result
df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()
Result for df
created by由
df
创建的结果
from io import StringIO
data = StringIO(
'''
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")
is是
CID RefID Date Group MID days_diff A B
0 100 1 2021-01-01 A NaN NaT 1 1.0
1 100 2 2021-01-03 A NaN 2 1 1.0
2 100 3 2021-01-04 A 101.0 3 1 1.0
3 100 4 2021-01-15 A NaN NaT 2 4.0
4 100 5 2021-01-18 A NaN 3 2 4.0
5 200 6 2021-03-03 B NaN NaT 1 6.0
6 200 7 2021-04-04 B NaN NaT 2 7.0
7 200 8 2021-04-09 B 102.0 5 2 7.0
8 200 9 2021-04-25 B NaN NaT 3 9.0
9 300 10 2021-04-26 C NaN NaT 1 10.0
10 300 11 2021-05-27 C NaN NaT 2 11.0
11 300 12 2021-05-28 C 103.0 1 2 11.0
A few explanations:几个解释:
days_diff
produces a dataframe with the two columns days_diff
and A
.days_diff
生成一个包含days_diff
和A
两列的数据days_diff
。 It is applied to the grouped by column CID
sub-dataframes of df
.CID
的子dataframes df
。result
(column days_diff
filled with NaT
, column A
with None
), and setting the starting value start
for the day differences to the first day in the group.days_diff
result
(列days_diff
填充NaT
,列A
填充None
),并将天差的起始值start
设置为组中的第一天。Date
, and a boolean value next_MID_is_na
that signifies if the value of the MID
column in the next row ist NaN
(via .shift(1).isna()
).Date
的值和布尔值next_MID_is_na
表示下一行中MID
列的值是否为NaN
(通过.shift(1).isna()
)。days_diff
column:days_diff
列的规则:
NaN
in next MID
-row -> day-difference.MID
-row -> day-difference 为NaN
。start
to the current day.start
重置为当天。days_diff
calculation of column A
: result.days_diff.isna()
is True
( == 1
) when days_diff
is NaN
, False
( == 0
) otherwise.A
列days_diff
计算后:当days_diff
为NaN
时days_diff
result.days_diff.isna()
为True
( == 1
), days_diff
为False
( == 0
)。 Therefore the cummulative sum ( .cumsum()
) gives the required result..cumsum()
) 给出了所需的结果。groupby-apply
to produce the columns days_diff
and A
finally the calculation of column B
: Selection of RefID
-values where the values A
change (via .where(df.A != df.A.shift(1))
), and then forward filling the remaining NaN
s.groupby-apply
生成列days_diff
和A
最后的列B
的计算之后:选择RefID
值,其中值A
更改(通过.where(df.A != df.A.shift(1))
),然后向前填充剩余的NaN
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.