简体   繁体   English

基于最高和最低行值以及时间线增加过滤 Dataframe

[英]Filter Dataframe Based on Highest and Lowest Row Values with Increasing Timeline

I have the following dataframe of students with their exam scores in different dates (sorted):我有以下 dataframe 的学生,他们在不同日期的考试成绩(排序):

df = pd.DataFrame({'student': 'A A A A B B B C C D D'.split(),
                  'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
                               datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
                               datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
                               datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
                               datetime.datetime(2014,6,2), datetime.datetime(2013,7,1),
                               datetime.datetime(2013,9,2),],
                   'score': [15, 22, 32, 20, 30, 38, 26, 18, 30, 33, 40]})

print(df)

   student  exam_date  score
0        A 2013-04-01     15
1        A 2013-06-01     22
2        A 2013-07-01     32
3        A 2013-09-02     20
4        B 2013-10-01     30
5        B 2013-11-02     38
6        B 2014-02-02     26
7        C 2014-05-02     18
8        C 2014-06-02     30
9        D 2013-07-01     33
10       D 2013-09-02     40

I need to keep only those rows where the highest score is increased by more than 10 from the lowest score, otherwise drop them.我只需要保留那些最高分数从最低分数增加超过 10 的行,否则将其丢弃。 Here, date is also important.在这里,日期也很重要。 The highest score has to be in the latter date than the previous date.最高分必须在后一个日期而不是前一个日期。

For example, for the student A , the lowest score is 15 and the score is increased to 32 (latter in the date), so we're gonna keep that.例如,对于学生A ,最低分数是15 ,分数增加到32 (在日期后面),所以我们将保留它。

For the student B , the lowest score is 26 , but there no score is increased after that.对于学生B ,最低分数是26 ,但之后分数没有增加。 It is basically decreased, so we're gonna drop that.它基本上减少了,所以我们要放弃它。

For the student C , the lowest score is 33 and the score is increased to 40 , Increase of only 7 , so we're gonna drop that.对于学生C ,最低分数是33 ,分数提高到40 ,只增加7 ,所以我们要放弃它。

I first tried df.groupby('student').agg({'score': np.ptp}) but it was tough to track if the score is decreased or increased.我首先尝试df.groupby('student').agg({'score': np.ptp})但很难跟踪分数是降低还是提高。

Then I tried to use df.loc[df.groupby('student')['score'].idxmin()] and df.loc[df.groupby('student')['score'].idxmax()] to get the min and values, but not sure how I would compare the dates.然后我尝试使用df.loc[df.groupby('student')['score'].idxmin()]df.loc[df.groupby('student')['score'].idxmax()]获取最小值和值,但不确定如何比较日期。 Maybe I merge them and then compare, but it's too much of work.也许我将它们合并然后比较,但它的工作量太大了。

Desired output:所需的 output:

student exam_date   score
2   A   2013-07-01  32
8   C   2014-06-02  30

#--For A, highest score of 32 increased by 17 from lowest score of 15  
#--For C, highest score of 30 increased by 12 from lowest score of 18 

What would be the smartest way of doing it?最聪明的做法是什么? Any suggestions would be appreciated.任何建议,将不胜感激。 Thanks!谢谢!

So in your case first filter by the min point因此,在您的情况下,首先按最小点过滤

con1 = df.groupby('student')['score'].transform('idxmin')
out = df[df.index>con1].set_index('exam_date').groupby('student')['score'].agg(['idxmax','max'])

out
Out[65]: 
            idxmax  max
student                
A       2013-07-01   32
C       2014-06-02   30

Assuming your dataframe is already sorted by date:假设您的 dataframe 已经按日期排序:

highest_score = lambda x: x['score'].cummax() * (x['score'] > x['score'].shift()) \
                          - (x['score'].cummin()) >= 10

out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)

# Output:
  student  exam_date  score
2       A 2013-07-01     32
8       C 2014-06-02     30

The expression * (x['score'] > x['score'].shift()) avoid cummax to be propagated if the next value is lower than the current max.如果下一个值低于当前最大值,则表达式* (x['score'] > x['score'].shift())避免传播cummax

This question is somewhat confusing, but this works for your sample data:这个问题有点令人困惑,但这适用于您的示例数据:

subset = df.loc[df.groupby('student').apply(lambda x: x['score'].idxmax() if x.sort_values('exam_date')['score'].diff().max() >= 10 else None).dropna().astype(int)]

Output: Output:

>>> subset
  student  exam_date  score
2       A 2013-07-01     32
8       C 2014-06-02     30

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM