简体   繁体   English

如何根据 dataframe 的值计数过滤组

[英]How to filter a groupby dataframe based on their values count

I have a groupby dataframe and I would like to return the top 3 groups with the highest value count.我有一个 groupby dataframe 并且我想返回具有最高值计数的前3 个组。

for eg the below dataframe expected output table should be group 20,30 and 33例如下面的 dataframe 预期 output 表应该是group 20,30 和 33

I wanted to display a raw dataset table, but the group function was not properly displayed on SO, That was I uploaded an image.我想显示一个原始数据集表,但是组 function 没有正确显示在 SO 上,那是我上传了一张图片。

                     amount         cosine_group
cosine_group            

0                   952.5              0
4                   3000.0             4    
20                  2000.0            20
                    2000.0            20
                    2000.0            20
27                  2000.0            27    

30                  2100.0            30
                    2100.0            30
                    2100.0            30
33                  1065.0            33
                    1065.0            33
                    1065.0            33
                    1065.0            33

Expected Output:预期 Output:

                     amount         cosine_group
cosine_group            

20                  2000.0            20
                    2000.0            20
                    2000.0            20

30                  2100.0            30
                    2100.0            30
                    2100.0            30
33                  1065.0            33
                    1065.0            33
                    1065.0            33
                    1065.0            33

在此处输入图像描述

You can use .nlargest(3) to select the 3 largest size.您可以使用.nlargest(3)到 select 的 3 个最大尺寸。 Use .isin() to match for those rows with these values.使用.isin()匹配具有这些值的那些行。 Finally, use .loc to return the rows in original dataframe of the largest elements, as follows:最后,使用.loc返回原始 dataframe 中最大元素的行,如下:

df = df.rename_axis(index='cosine_group0')   # to rename index axis name
df.loc[df['cosine_group'].isin(df.groupby('cosine_group', as_index=False)['cosine_group'].size().nlargest(3, 'size')['cosine_group'].tolist())]

Or use:或使用:

df = df.rename_axis(index='cosine_group0')   # to rename index axis
df.loc[df["cosine_group"].isin(df["cosine_group"].value_counts().nlargest(3).index)]

This may not be very pythonic but definitely gets the work done.这可能不是很pythonic,但绝对可以完成工作。

# retieve the index of the value counts
cosine_group_value = df["cosine_group"].value_counts().index

# get the fist 3 values  from the value counts (highest 3 values)
top3 = list(cosine_group_value)[:3]

# filter your dataframe using the top 3 values on the cosine_group column
df = df[df["cosine_group"].isin(top3)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM