简体   繁体   English

根据每组 python pandas groupby 中的另一列计算列的值

[英]calculate a column's value based on another column in each group of python pandas groupby

Below is a sample data frame:下面是一个示例数据框:

df = pd.DataFrame({'StudentName': ['Anil','Ramu','Ramu','Anil','Peter','Peter','Anil','Ramu','Peter','Anil'],
                   'ExamDate': ['2021-01-10','2021-01-20','2021-02-22','2021-03-30','2021-01-04','2021-06-06','2021-04-30','2021-07-30','2021-07-08','2021-09-07'],
                   'Result': ['Fail','Pass','Fail','Pass','Pass','Pass','Pass','Pass','Fail','Pass']})


  StudentName    ExamDate Result
0        Anil  2021-01-10   Fail
1        Ramu  2021-01-20   Pass
2        Ramu  2021-02-22   Fail
3        Anil  2021-03-30   Pass
4       Peter  2021-01-04   Pass
5       Peter  2021-06-06   Pass
6        Anil  2021-04-30   Pass
7        Ramu  2021-07-30   Pass
8       Peter  2021-07-08   Fail
9        Anil  2021-09-07   Pass

For each row, I would like to calculate the number of days it has been since that student's last failed test:对于每一行,我想计算自该学生上次未通过测试以来的天数:

df = pd.DataFrame({'StudentName': ['Anil','Ramu','Ramu','Anil','Peter','Peter','Anil','Ramu','Peter','Anil'],
                   'ExamDate': ['2021-01-10','2021-01-20','2021-02-22','2021-03-30','2021-01-04','2021-06-06','2021-04-30','2021-07-30','2021-07-08','2021-09-07'],
                   'Result': ['Fail','Pass','Fail','Pass','Pass','Pass','Pass','Pass','Fail','Pass'],
                   'LastFailedDays': [0, 0, 0, 79, 0, 0, 110, 158, 0, 240]})


  StudentName    ExamDate Result  LastFailedDays
0        Anil  2021-01-10   Fail               0
1        Ramu  2021-01-20   Pass               0
2        Ramu  2021-02-22   Fail               0
3        Anil  2021-03-30   Pass              79
4       Peter  2021-01-04   Pass               0
5       Peter  2021-06-06   Pass               0
6        Anil  2021-04-30   Pass             110
7        Ramu  2021-07-30   Pass             158
8       Peter  2021-07-08   Fail               0
9        Anil  2021-09-07   Pass             240

For example:例如:

  • Anil failed on 2021-01-10, so for that row it will be zero days, Anil 在 2021 年 1 月 10 日失败,因此该行将是零天,
  • Anil's next record, which is successful, is on 2021-03-30, so the number of days for that row will be the number of days from his previous failed date 2021-01-10 to 2021-03-30, which is 79 days. Anil 的下一个成功记录是在 2021-03-30,因此该行的天数将是从他之前失败的日期 2021-01-10 到 2021-03-30 的天数,即 79天。
  • Anil's third record, which is also successful, is on 2021-04-30, so the number of days there will be again, the number of days 2021-01-10 (his last failed date) to 2021-04-30. Anil的第三个记录,也是成功的,是在2021-04-30,所以天数会再次出现,从2021-01-10(他最后一次失败的日期)到2021-04-30的天数。

It is doable with regular loops but I am looking for a more conventional Pandas solution.常规循环是可行的,但我正在寻找更传统的 Pandas 解决方案。 I'm guessing it's possible with groupby .我猜这可能与groupby

I've finally come up with a solution that works.我终于想出了一个可行的解决方案。

# Process the data a bit
df['Tmp_Result'] = df['Result'].map({'Pass': 1, 'Fail': 0})
df['ExamDate'] = pd.to_datetime(df['ExamDate'])

# Create a mask that will be used to group the rows by StudentName + consecutive passed tests after a failed test (including the failed test)
sorted_df = df.sort_values(['StudentName', 'ExamDate']) 
mask = sorted_df.groupby('StudentName')['Tmp_Result'].diff().ne(0).cumsum()
mask[(sorted_df['Tmp_Result'].eq(0) & ~(pd.isna(sorted_df.groupby('StudentName')['Tmp_Result'].shift(-1))))] += 1

df['LastFailedDays'] = df.groupby(mask)['ExamDate'].diff().fillna(pd.Timedelta(0))
df['LastFailedDays'] = df.groupby(mask)['LastFailedDays'].cumsum()

# Cleanup
df = df.drop('Tmp_Result', axis=1)

Output: Output:

>>> df
  StudentName   ExamDate Result LastFailedDays
0        Anil 2021-01-10   Fail         0 days
1        Ramu 2021-01-20   Pass         0 days
2        Ramu 2021-02-22   Fail         0 days
3        Anil 2021-03-30   Pass        79 days
4       Peter 2021-01-04   Pass         0 days
5       Peter 2021-06-06   Pass       153 days
6        Anil 2021-04-30   Pass       110 days
7        Ramu 2021-07-30   Pass       158 days
8       Peter 2021-07-08   Fail         0 days
9        Anil 2021-09-07   Pass       240 days

>>> df.sort_values(['StudentName', 'ExamDate'])
  StudentName   ExamDate Result LastFailedDays
0        Anil 2021-01-10   Fail         0 days
3        Anil 2021-03-30   Pass        79 days
6        Anil 2021-04-30   Pass       110 days
9        Anil 2021-09-07   Pass       240 days
4       Peter 2021-01-04   Pass         0 days
5       Peter 2021-06-06   Pass       153 days
8       Peter 2021-07-08   Fail         0 days
1        Ramu 2021-01-20   Pass         0 days
2        Ramu 2021-02-22   Fail         0 days
7        Ramu 2021-07-30   Pass       158 days

It's a bit gruesome to the eyes, but because it's vectorized, it should be a lot faster than any solution using loops.这对眼睛来说有点可怕,但因为它是矢量化的,它应该比任何使用循环的解决方案都要快得多。

TL;DR TL;博士

Use Series.where and groupby.ffill to generate the last failed dates and subtract them from ExamDate to get LastFailedDays :使用Series.wheregroupby.ffill生成最后失败的日期并将它们从ExamDate中减去以获得LastFailedDays

df['ExamDate'] = pd.to_datetime(df['ExamDate'])
last_failed_date = (df['ExamDate'].where(df['Result'] == 'Fail')
                                  .groupby(df['StudentName']).ffill())
df['LastFailedDays'] = df['ExamDate'].sub(last_failed_date).dt.days.fillna(0)

#   StudentName    ExamDate  Result  LastFailedDays
# 0        Anil  2021-01-10    Fail             0.0
# 1        Ramu  2021-01-20    Pass             0.0
# 2        Ramu  2021-02-22    Fail             0.0
# 3        Anil  2021-03-30    Pass            79.0
# 4       Peter  2021-01-04    Pass             0.0
# 5       Peter  2021-06-06    Pass             0.0
# 6        Anil  2021-04-30    Pass           110.0
# 7        Ramu  2021-07-30    Pass           158.0
# 8       Peter  2021-07-08    Fail             0.0
# 9        Anil  2021-09-07    Pass           240.0

Details细节

  1. Convert to_datetime :转换to_datetime

     df['ExamDate'] = pd.to_datetime(df['ExamDate'])
  2. Use Series.where to generate the last failed dates (here I've made it a column for easier visualization):使用Series.where生成最后失败的日期(这里我将其设为一列以便于可视化):

     df['LastFailedDate'] = df['ExamDate'].where(df['Result'] == 'Fail') # StudentName ExamDate Result LastFailedDate # 0 Anil 2021-01-10 Fail 2021-01-10 # 1 Ramu 2021-01-20 Pass NaT # 2 Ramu 2021-02-22 Fail 2021-02-22 # 3 Anil 2021-03-30 Pass NaT # 4 Peter 2021-01-04 Pass NaT # 5 Peter 2021-06-06 Pass NaT # 6 Anil 2021-04-30 Pass NaT # 7 Ramu 2021-07-30 Pass NaT # 8 Peter 2021-07-08 Fail 2021-07-08 # 9 Anil 2021-09-07 Pass NaT
  3. Use groupby.ffill to forward-fill the last failed dates per student:使用groupby.ffill向前填充每个学生最后失败的日期:

     df['LastFailedDate'] = df['LastFailedDate'].groupby(df['StudentName']).ffill() # StudentName ExamDate Result LastFailedDate # 0 Anil 2021-01-10 Fail 2021-01-10 # 1 Ramu 2021-01-20 Pass NaT # 2 Ramu 2021-02-22 Fail 2021-02-22 # 3 Anil 2021-03-30 Pass 2021-01-10 # 4 Peter 2021-01-04 Pass NaT # 5 Peter 2021-06-06 Pass NaT # 6 Anil 2021-04-30 Pass 2021-01-10 # 7 Ramu 2021-07-30 Pass 2021-02-22 # 8 Peter 2021-07-08 Fail 2021-07-08 # 9 Anil 2021-09-07 Pass 2021-01-10
  4. Finally subtract the exam dates by the last failed dates and use dt.days to extract the number of days:最后用最后失败的日期减去考试日期并使用dt.days提取天数:

     df['LastFailedDays'] = df['ExamDate'].sub(df['LastFailedDate']).dt.days.fillna(0) # StudentName ExamDate Result LastFailedDate LastFailedDays # 0 Anil 2021-01-10 Fail 2021-01-10 0.0 # 1 Ramu 2021-01-20 Pass NaT 0.0 # 2 Ramu 2021-02-22 Fail 2021-02-22 0.0 # 3 Anil 2021-03-30 Pass 2021-01-10 79.0 # 4 Peter 2021-01-04 Pass NaT 0.0 # 5 Peter 2021-06-06 Pass NaT 0.0 # 6 Anil 2021-04-30 Pass 2021-01-10 110.0 # 7 Ramu 2021-07-30 Pass 2021-02-22 158.0 # 8 Peter 2021-07-08 Fail 2021-07-08 0.0 # 9 Anil 2021-09-07 Pass 2021-01-10 240.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas/Python groupby,然后计算每组中另一列的平均值 - Pandas/Python groupby and then calculate mean for another column within each group Pandas:根据 groupby sum 结果与另一列中的值的比较来修改每组中最后一个单元格的值 - Pandas: Modify the value of last cell in each group based on how the groupby sum result compares to the value in another column 熊猫如何根据每组的长度和另一列的计数值计算按组结果 - Pandas how to calculate bygroup result based on the length of the each group and a count value of another column Pandas groupby 并保留另一列的值 - Pandas groupby and retain another column's value Pandas / Pythonic方式将X列分组,在每个组中,根据Z列的值返回Y列的值 - Pandas/Pythonic way to groupby a column X, within each group, return value in column Y based on value in column Z Pandas Groupby:根据另一列的值从组的前一个元素中获取值 - Pandas Groupby: get value from previous element of a group based on value of another column python:pandas:如何基于groupby另一列在列中查找最大值 - python: pandas: how to find max value in a column based on groupby another column 多列 groupby 与 pandas 找到每个组的最大值 - Multiple column groupby with pandas to find maximum value for each group Pandas groupby 基于列值 - Pandas groupby based on column value 使用Python熊猫在groupby组中从第n个值创建新列 - Create new column from nth value in a groupby group with Python pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM