如何根据 dataframe 的值计数过滤组

Question

I have a groupby dataframe and I would like to return the top 3 groups with the highest value count.我有一个 groupby dataframe 并且我想返回具有最高值计数的前3 个组。

for eg the below dataframe expected output table should be group 20,30 and 33例如下面的 dataframe 预期 output 表应该是group 20,30 和 33

I wanted to display a raw dataset table, but the group function was not properly displayed on SO, That was I uploaded an image.我想显示一个原始数据集表，但是组 function 没有正确显示在 SO 上，那是我上传了一张图片。

                     amount         cosine_group
cosine_group            

0                   952.5              0
4                   3000.0             4    
20                  2000.0            20
                    2000.0            20
                    2000.0            20
27                  2000.0            27    

30                  2100.0            30
                    2100.0            30
                    2100.0            30
33                  1065.0            33
                    1065.0            33
                    1065.0            33
                    1065.0            33

Expected Output:预期 Output：

                     amount         cosine_group
cosine_group            

20                  2000.0            20
                    2000.0            20
                    2000.0            20

30                  2100.0            30
                    2100.0            30
                    2100.0            30
33                  1065.0            33
                    1065.0            33
                    1065.0            33
                    1065.0            33

Answer 1

You can use .nlargest(3) to select the 3 largest size.您可以使用.nlargest(3)到 select 的 3 个最大尺寸。 Use .isin() to match for those rows with these values.使用.isin()匹配具有这些值的那些行。 Finally, use .loc to return the rows in original dataframe of the largest elements, as follows:最后，使用.loc返回原始 dataframe 中最大元素的行，如下：

df = df.rename_axis(index='cosine_group0')   # to rename index axis name
df.loc[df['cosine_group'].isin(df.groupby('cosine_group', as_index=False)['cosine_group'].size().nlargest(3, 'size')['cosine_group'].tolist())]

Or use:或使用：

df = df.rename_axis(index='cosine_group0')   # to rename index axis
df.loc[df["cosine_group"].isin(df["cosine_group"].value_counts().nlargest(3).index)]

Answer 2

This may not be very pythonic but definitely gets the work done.这可能不是很pythonic，但绝对可以完成工作。

# retieve the index of the value counts
cosine_group_value = df["cosine_group"].value_counts().index

# get the fist 3 values  from the value counts (highest 3 values)
top3 = list(cosine_group_value)[:3]

# filter your dataframe using the top 3 values on the cosine_group column
df = df[df["cosine_group"].isin(top3)]

如何根据 dataframe 的值计数过滤组

问题描述

2 个解决方案

解决方案1
3 2021-06-02 09:34:00

解决方案2
1 已采纳 2021-06-02 09:31:02

如何根据 dataframe 的值计数过滤组

问题描述

2 个解决方案

解决方案1 3 2021-06-02 09:34:00

解决方案2 1 已采纳 2021-06-02 09:31:02

解决方案1
3 2021-06-02 09:34:00

解决方案2
1 已采纳 2021-06-02 09:31:02