Pandas 按多列分组和排序

Question

I have some initial data that looks like this:我有一些看起来像这样的初始数据：

code          type          value
1111          Golf     Acceptable
1111          Golf    Undesirable
1111    Basketball     Acceptable
1111    Basketball    Undesirable
1111    Basketball    Undesirable

and I'm trying to group it on the code and type columns to get the row with the most occurrences.我正在尝试将其按code分组并type列以获取出现次数最多的行。 In the case of a tie, I want to select the row with the value Undesirable .在平局的情况下，我想 select 具有值Undesirable的行。 So the example above would become this:所以上面的例子会变成这样：

code          type          value
1111          Golf    Undesirable
1111    Basketball    Undesirable

Currently I'm doing it this way:目前我正在这样做：

df = pd.DataFrame(df.groupby(['code', 'type', 'value']).size().reset_index(name='count'))

df = df.sort_values(['type', 'count'])

df = pd.DataFrame(df.groupby(['code', 'type']).last().reset_index())

I've done some testing of this and it seems to do what I want, but I don't really like trusting the .last() call, and hoping in the case of a tie that Undesirable was sorted last.我已经对此进行了一些测试，它似乎可以满足我的要求，但我真的不喜欢信任.last()调用，并希望在平局的情况下Undesirable排在最后。 Is there a better way to group this to ensure I always get the higher count, or in the cases of a tie select the Undesirable value?有没有更好的方法来分组这个以确保我总是得到更高的计数，或者在平局Undesirable的情况下是不受欢迎的值？

Performance isn't too much of an issue as I'm only working with around 50k rows or so.性能并不是什么大问题，因为我只处理大约 50k 行左右。

Answer 1

Case 1情况1

If the value column only contains two values ie ['Acceptable', 'Undesirable'] then we can rely on the fact that Acceptable < Undesirable alphabetically.如果value列仅包含两个值，即['Acceptable', 'Undesirable']那么我们可以依赖Acceptable < Undesirable按字母顺序排列的事实。 In this case you can use the following simplified solution.在这种情况下，您可以使用以下简化的解决方案。

Create an auxiliary column called count which contain the count of number of rows per code , type and value .创建一个名为count的辅助列，其中包含每个code 、 type和value的行数。 Then sort the dataframe by count and value and drop the dupes per code and type keeping the last row.然后按count和value对type进行排序，并按code删除重复项并保留最后一行。

c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')

Case 2案例2

If the value column contains other values and you can't rely on alphabetical ordering use the following solution which is similar to solution proposed in case 1 but this first converts the value column to ordered Categorical type before sorting如果value列包含其他值并且您不能依赖字母顺序，请使用以下解决方案，该解决方案类似于案例 1 中提出的解决方案，但这首先将值列转换为有序Categorical类型，然后再进行排序

c = ['code', 'type']

df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df['value'] = pd.Categorical(df['value'], categories=['Acceptable', 'Undesirable'], ordered=True)

df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')

Result结果

   code        type        value  count
1  1111        Golf  Undesirable      1
4  1111  Basketball  Undesirable      2

Answer 2

Another possible solution, which is based on the following ideas:另一种可能的解决方案，基于以下思路：

Grouping the data by code and type .按code和type对数据进行分组。
If a group has more than one row ( len(x) > 1 ) and its rows have the same count ( x['count'] == x['count'].min()).all() ), return the row with Undesirable .如果一个组有不止一行（ len(x) > 1 ）并且它的行有相同的计数（ x['count'] == x['count'].min()).all() ) ，返回带有Undesirable的行。
Otherwise, return the row where the count is maximum ( x.iloc[[x['count'].argmax()]] ).否则，返回计数最大的行 ( x.iloc[[x['count'].argmax()]] )。

(df.groupby(['code', 'type', 'value'])['value'].size()
 .reset_index(name='count').groupby(['code', 'type'])
 .apply(lambda x: x.loc[x['value'] == 'Undesirable'] if 
        ((len(x) > 1) and (x['count'] == x['count'].min()).all()) else
        x.iloc[[x['count'].argmax()]])
 .reset_index(drop=True)
 .drop('count', axis=1))

Output: Output：

   code        type        value
0  1111  Basketball  Undesirable
1  1111        Golf  Undesirable

Pandas 按多列分组和排序

问题描述

2 个解决方案

解决方案1
1 2022-09-26 16:49:35

Case 1情况1

Case 2案例2

Result结果

解决方案2
0 2022-09-26 17:45:17

Pandas 按多列分组和排序

问题描述

2 个解决方案

解决方案1 1 2022-09-26 16:49:35

Case 1情况1

Case 2案例2

Result结果

解决方案2 0 2022-09-26 17:45:17

解决方案1
1 2022-09-26 16:49:35

解决方案2
0 2022-09-26 17:45:17