[英]How to calculate the time difference groupby ID between the min date and the date were values changed
For each unique ID, I want to calculate the time difference (in days) between their initial date (min(DATE)) and the date were their C1 is greater than their initial C1 OR their C2 is less than their initial C2.对于每个唯一 ID,我想计算其初始日期 (min(DATE)) 与 C1 大于初始 C1 或 C2 小于初始 C2 的日期之间的时间差(以天为单位)。 want to skip that ID's that has only one record and ID's that value doesn't change
想要跳过只有一条记录的 ID 并且 ID 的值不会改变
ID DATE C1 C2
AACH 2022-06-10 05:00:00+00:00 70 2
AAHA 2022-01-12 06:00:00+00:00 60 6
AAHA 2022-04-07 05:00:00+00:00 60 4
AAHA 2022-05-20 05:00:00+00:00 60 5
AALU 2021-09-10 05:00:00+00:00 70 0
AALU 2021-11-29 06:00:00+00:00 70 4
AALU 2022-05-17 05:00:00+00:00 60 5
ABAL 2021-10-11 05:00:00+00:00 60 0
ABAL 2022-03-17 05:00:00+00:00 80 4
ABAN 2021-05-24 05:00:00+00:00 60 3
ABAN 2021-06-24 05:00:00+00:00 70 2
ABAN 2021-08-10 05:00:00+00:00 60 3
ABAN 2022-01-14 06:00:00+00:00 70 2
ABAN 2022-03-18 05:00:00+00:00 60 5
ABAN 2022-04-21 05:00:00+00:00 70 2
My expected output is:我的预期输出是:
ID Time Difference(Days) Date of value changed
AAHA 2022-01-12 06:00:00+00:00
AALU 2021-09-10 05:00:00+00:00
ABAL 2021-10-11 05:00:00+00:00
ABAN 2021-05-24 05:00:00+00:00
use dataframe sort_values and iloc使用数据框 sort_values 和 iloc
txt="""ID,DATE,C1,C2
AACH,2022-06-10 05:00:00+00:00,70,2
AAHA,2022-01-12 06:00:00+00:00,60,6
AAHA,2022-04-07 05:00:00+00:00,60,4
AAHA,2022-05-20 05:00:00+00:00,60,5
AALU,2021-09-10 05:00:00+00:00,70,0
AALU,2021-11-29 06:00:00+00:00,70,4
AALU,2022-05-17 05:00:00+00:00,60,5
ABAL,2021-10-11 05:00:00+00:00,60,0
ABAL,2022-03-17 05:00:00+00:00,80,4
ABAN,2021-05-24 05:00:00+00:00,60,3
ABAN,2021-06-24 05:00:00+00:00,70,2
ABAN,2021-08-10 05:00:00+00:00,60,3
ABAN,2022-01-14 06:00:00+00:00,70,2
ABAN,2022-03-18 05:00:00+00:00,60,5
ABAN,2022-04-21 05:00:00+00:00,70,2"""
df = pd.read_csv(io.StringIO(txt), sep=',')
print(df.columns)
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.sort_values(by=['ID','DATE'])
old_ID=""
for index,row in df.iterrows():
if row['ID']!=old_ID:
min_c1=df.iloc[index]['C1']
min_c2=df.iloc[index]['C2']
df.loc[index,'C1_GT_initC1']=df.iloc[index]['C1']>min_c1
df.loc[index,'C2_LT_initC2']=df.iloc[index]['C2']<min_c2
old_ID=row['ID']
print(df)
output:输出:
ID DATE C1 C2 C1_GT_initC1 C2_LT_initC2
0 AACH 2022-06-10 05:00:00+00:00 70 2 False False
1 AAHA 2022-01-12 06:00:00+00:00 60 6 False False
2 AAHA 2022-04-07 05:00:00+00:00 60 4 False True
3 AAHA 2022-05-20 05:00:00+00:00 60 5 False True
4 AALU 2021-09-10 05:00:00+00:00 70 0 False False
5 AALU 2021-11-29 06:00:00+00:00 70 4 False False
6 AALU 2022-05-17 05:00:00+00:00 60 5 False False
7 ABAL 2021-10-11 05:00:00+00:00 60 0 False False
8 ABAL 2022-03-17 05:00:00+00:00 80 4 True False
9 ABAN 2021-05-24 05:00:00+00:00 60 3 False False
10 ABAN 2021-06-24 05:00:00+00:00 70 2 True True
11 ABAN 2021-08-10 05:00:00+00:00 60 3 False False
12 ABAN 2022-01-14 06:00:00+00:00 70 2 True True
13 ABAN 2022-03-18 05:00:00+00:00 60 5 False False
14 ABAN 2022-04-21 05:00:00+00:00 70 2 True True
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.