[英]Filtering Dataframe based on many conditions
here is my problem:这是我的问题:
I have a dataFrame that look like this:我有一个看起来像这样的 dataFrame:
Date Name Score Country
2012 Paul 45 Mexico
2012 Mike 38 Sweden
2012 Teddy 62 USA
2012 Hilary 80 USA
2013 Ashley 42 France
2013 Temari 58 UK
2013 Harry 78 UK
2013 Silvia 55 Italy
I want to select the two best scores, with a filter by date and also from a different country.我想 select 两个最好的分数,按日期过滤,也来自不同的国家。
For example here: In 2012 Hilary has the best score (USA) so she will be selected.例如这里: 2012 年希拉里的得分最高(美国),因此她将被选中。 Teddy has the second best score in 2012 but he won't be selected as he comes from the same country (USA) So Paul will be selected instead as he comes from a different country (Mexico).
泰迪在 2012 年获得了第二好的成绩,但他不会被选中,因为他来自同一个国家(美国),所以保罗将被选中,因为他来自不同的国家(墨西哥)。
This is what I did:这就是我所做的:
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
'Score': [45, 38, 62, 80, 42, 58,78,55],
"Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})
And then I made the filter by Date and by Score:然后我按日期和分数进行过滤:
df1 = df.set_index('Name').groupby('Date')['Score'].apply(lambda grp: grp.nlargest(2))
But I don't really know and to do the filter that takes into account that they have to come from a different country.但我真的不知道并考虑到他们必须来自不同的国家来做过滤。
Does anyone have an idea on that?有人对此有想法吗? Thank you so much
太感谢了
EDIT: The answer I am looking for should be something like that:编辑:我正在寻找的答案应该是这样的:
Date Name Score Country
2012 Hilary 80 USA
2012 Paul 45 Mexico
2013 Harry 78 UK
2013 Silvia 55 Italy
Filter two people by date, best score and from a different country按日期、最高分和来自不同国家/地区过滤两个人
sort_values
+ tail
sort_values
+ tail
s=df.sort_values('Score').drop_duplicates(['Date','Country'],keep='last').groupby('Date').tail(2)
s
Date Name Score Country
0 2012 Paul 45 Mexico
7 2013 Silvia 55 Italy
6 2013 Harry 78 UK
3 2012 Hilary 80 USA
You can group by a list use the code below:您可以使用以下代码按列表分组:
df1 = df.set_index('Name').groupby(['Date', 'Country'])['Score'].apply(lambda grp: grp.nlargest(1))
It will put out this:它会发出这个:
Date Country Name Score
2012 Mexico Paul 45
Sweden Mike 38
USA Hilary 80
2013 France Ashley 42
Italy Silvia 55
UK Harry 78
EDIT:编辑:
Based on new information here is a solution.根据新信息,这里是一个解决方案。 It might be able to be improved a bit but it works.
它可能会有所改进,但它确实有效。
df.sort_values(['Score'],ascending=False, inplace=True)
df.sort_values(['Date'], inplace=True)
df.drop_duplicates(['Date', 'Country'], keep='first', inplace=True)
df1 = df.groupby('Date').head(2).reset_index(drop=True)
This outputs这输出
Date Name Score Country
0 2012 Hilary 80 USA
1 2012 Paul 45 Mexico
2 2013 Harry 78 UK
3 2013 Silvia 55 Italy
I have used different longer approach, which anyone hasn't submitted so far.我使用了不同的更长的方法,到目前为止还没有人提交过。
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
'Score': [45, 38, 62, 80, 42, 58,78,55],
"Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})
df1=df.groupby(['Date','Country'])['Score'].max().reset_index()
df2=df.iloc[:,[1,2]]
df1.merge(df2)
This is little convoluted but does the work.这有点令人费解,但确实有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.