基於最高和最低行值以及時間線增加過濾 Dataframe

Question

我有以下 dataframe 的學生，他們在不同日期的考試成績（排序）：

df = pd.DataFrame({'student': 'A A A A B B B C C D D'.split(),
                  'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
                               datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
                               datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
                               datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
                               datetime.datetime(2014,6,2), datetime.datetime(2013,7,1),
                               datetime.datetime(2013,9,2),],
                   'score': [15, 22, 32, 20, 30, 38, 26, 18, 30, 33, 40]})

print(df)

   student  exam_date  score
0        A 2013-04-01     15
1        A 2013-06-01     22
2        A 2013-07-01     32
3        A 2013-09-02     20
4        B 2013-10-01     30
5        B 2013-11-02     38
6        B 2014-02-02     26
7        C 2014-05-02     18
8        C 2014-06-02     30
9        D 2013-07-01     33
10       D 2013-09-02     40

我只需要保留那些最高分數從最低分數增加超過 10 的行，否則將其丟棄。 在這里，日期也很重要。 最高分必須在后一個日期而不是前一個日期。

例如，對於學生A ，最低分數是15 ，分數增加到32 （在日期后面），所以我們將保留它。

對於學生B ，最低分數是26 ，但之后分數沒有增加。 它基本上減少了，所以我們要放棄它。

對於學生C ，最低分數是33 ，分數提高到40 ，只增加7 ，所以我們要放棄它。

我首先嘗試df.groupby('student').agg({'score': np.ptp})但很難跟蹤分數是降低還是提高。

然后我嘗試使用df.loc[df.groupby('student')['score'].idxmin()]和df.loc[df.groupby('student')['score'].idxmax()]獲取最小值和值，但不確定如何比較日期。 也許我將它們合並然后比較，但它的工作量太大了。

所需的 output：

student exam_date   score
2   A   2013-07-01  32
8   C   2014-06-02  30

#--For A, highest score of 32 increased by 17 from lowest score of 15  
#--For C, highest score of 30 increased by 12 from lowest score of 18

最聰明的做法是什么？ 任何建議，將不勝感激。 謝謝！

Answer 1

因此，在您的情況下，首先按最小點過濾

con1 = df.groupby('student')['score'].transform('idxmin')
out = df[df.index>con1].set_index('exam_date').groupby('student')['score'].agg(['idxmax','max'])

out
Out[65]: 
            idxmax  max
student                
A       2013-07-01   32
C       2014-06-02   30

Answer 2

假設您的 dataframe 已經按日期排序：

highest_score = lambda x: x['score'].cummax() * (x['score'] > x['score'].shift()) \
                          - (x['score'].cummin()) >= 10

out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)

# Output:
  student  exam_date  score
2       A 2013-07-01     32
8       C 2014-06-02     30

如果下一個值低於當前最大值，則表達式* (x['score'] > x['score'].shift())避免傳播cummax 。

Answer 3

這個問題有點令人困惑，但這適用於您的示例數據：

subset = df.loc[df.groupby('student').apply(lambda x: x['score'].idxmax() if x.sort_values('exam_date')['score'].diff().max() >= 10 else None).dropna().astype(int)]

Output：

>>> subset
  student  exam_date  score
2       A 2013-07-01     32
8       C 2014-06-02     30

基於最高和最低行值以及時間線增加過濾 Dataframe

問題描述

3 個解決方案

解決方案1
0 2021-12-21 22:29:11

解決方案2
0 2021-12-21 22:40:32

解決方案3
0 2021-12-21 22:49:21

基於最高和最低行值以及時間線增加過濾 Dataframe

問題描述

3 個解決方案

解決方案1 0 2021-12-21 22:29:11

解決方案2 0 2021-12-21 22:40:32

解決方案3 0 2021-12-21 22:49:21

解決方案1
0 2021-12-21 22:29:11

解決方案2
0 2021-12-21 22:40:32

解決方案3
0 2021-12-21 22:49:21