简体   繁体   English

基于多种条件过滤Dataframe

[英]Filtering Dataframe based on many conditions

here is my problem:这是我的问题:

I have a dataFrame that look like this:我有一个看起来像这样的 dataFrame:

Date  Name  Score  Country
2012  Paul    45    Mexico
2012  Mike    38    Sweden
2012  Teddy   62    USA 
2012  Hilary  80    USA 
2013  Ashley  42    France 
2013  Temari  58    UK 
2013  Harry   78    UK
2013  Silvia  55    Italy

I want to select the two best scores, with a filter by date and also from a different country.我想 select 两个最好的分数,按日期过滤,也来自不同的国家。

For example here: In 2012 Hilary has the best score (USA) so she will be selected.例如这里: 2012 年希拉里的得分最高(美国),因此她将被选中。 Teddy has the second best score in 2012 but he won't be selected as he comes from the same country (USA) So Paul will be selected instead as he comes from a different country (Mexico).泰迪在 2012 年获得了第二好的成绩,但他不会被选中,因为他来自同一个国家(美国),所以保罗将被选中,因为他来自不同的国家(墨西哥)。

This is what I did:这就是我所做的:

df = pd.DataFrame(
    {'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
     'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
     'Score': [45, 38, 62, 80, 42, 58,78,55],
     "Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})

And then I made the filter by Date and by Score:然后我按日期和分数进行过滤:

df1 = df.set_index('Name').groupby('Date')['Score'].apply(lambda grp: grp.nlargest(2))

But I don't really know and to do the filter that takes into account that they have to come from a different country.但我真的不知道并考虑到他们必须来自不同的国家来做过滤。

Does anyone have an idea on that?有人对此有想法吗? Thank you so much太感谢了

EDIT: The answer I am looking for should be something like that:编辑:我正在寻找的答案应该是这样的:

Date  Name  Score  Country
2012  Hilary  80    USA 
2012  Paul    45    Mexico
2013  Harry   78    UK
2013  Silvia  55    Italy

Filter two people by date, best score and from a different country按日期、最高分和来自不同国家/地区过滤两个人

sort_values + tail sort_values + tail

s=df.sort_values('Score').drop_duplicates(['Date','Country'],keep='last').groupby('Date').tail(2)
s
   Date    Name  Score Country
0  2012    Paul     45  Mexico
7  2013  Silvia     55   Italy
6  2013   Harry     78      UK
3  2012  Hilary     80     USA

You can group by a list use the code below:您可以使用以下代码按列表分组:

df1 = df.set_index('Name').groupby(['Date', 'Country'])['Score'].apply(lambda grp: grp.nlargest(1))

It will put out this:它会发出这个:

Date  Country  Name     Score
2012  Mexico   Paul      45
      Sweden   Mike      38
      USA      Hilary    80
2013  France   Ashley    42
      Italy    Silvia    55
      UK       Harry     78

EDIT:编辑:

Based on new information here is a solution.根据新信息,这里是一个解决方案。 It might be able to be improved a bit but it works.它可能会有所改进,但它确实有效。

df.sort_values(['Score'],ascending=False, inplace=True)
df.sort_values(['Date'], inplace=True)
df.drop_duplicates(['Date', 'Country'], keep='first', inplace=True)
df1 = df.groupby('Date').head(2).reset_index(drop=True)

This outputs这输出

   Date    Name  Score Country
0  2012  Hilary     80     USA
1  2012    Paul     45  Mexico
2  2013   Harry     78      UK
3  2013  Silvia     55   Italy
df.groupby(['Country','Name','Date'])['Score'].agg(Score=('Score','first')).reset_index().drop_duplicates(subset='Country', keep='first')

result结果

在此处输入图像描述

I have used different longer approach, which anyone hasn't submitted so far.我使用了不同的更长的方法,到目前为止还没有人提交过。

df = pd.DataFrame(
    {'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
     'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
     'Score': [45, 38, 62, 80, 42, 58,78,55],
     "Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})

df1=df.groupby(['Date','Country'])['Score'].max().reset_index()

df2=df.iloc[:,[1,2]]

df1.merge(df2)

This is little convoluted but does the work.这有点令人费解,但确实有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM