[英]Pandas Group By and Sorting by multiple columns
I have some initial data that looks like this:我有一些看起来像这样的初始数据:
code type value
1111 Golf Acceptable
1111 Golf Undesirable
1111 Basketball Acceptable
1111 Basketball Undesirable
1111 Basketball Undesirable
and I'm trying to group it on the code
and type
columns to get the row with the most occurrences.我正在尝试将其按
code
分组并type
列以获取出现次数最多的行。 In the case of a tie, I want to select the row with the value Undesirable
.在平局的情况下,我想 select 具有值
Undesirable
的行。 So the example above would become this:所以上面的例子会变成这样:
code type value
1111 Golf Undesirable
1111 Basketball Undesirable
Currently I'm doing it this way:目前我正在这样做:
df = pd.DataFrame(df.groupby(['code', 'type', 'value']).size().reset_index(name='count'))
df = df.sort_values(['type', 'count'])
df = pd.DataFrame(df.groupby(['code', 'type']).last().reset_index())
I've done some testing of this and it seems to do what I want, but I don't really like trusting the .last()
call, and hoping in the case of a tie that Undesirable
was sorted last.我已经对此进行了一些测试,它似乎可以满足我的要求,但我真的不喜欢信任
.last()
调用,并希望在平局的情况下Undesirable
排在最后。 Is there a better way to group this to ensure I always get the higher count, or in the cases of a tie select the Undesirable
value?有没有更好的方法来分组这个以确保我总是得到更高的计数,或者在平局
Undesirable
的情况下是不受欢迎的值?
Performance isn't too much of an issue as I'm only working with around 50k rows or so.性能并不是什么大问题,因为我只处理大约 50k 行左右。
If the value
column only contains two values ie ['Acceptable', 'Undesirable']
then we can rely on the fact that Acceptable
< Undesirable
alphabetically.如果
value
列仅包含两个值,即['Acceptable', 'Undesirable']
那么我们可以依赖Acceptable
< Undesirable
按字母顺序排列的事实。 In this case you can use the following simplified solution.在这种情况下,您可以使用以下简化的解决方案。
Create an auxiliary column called count
which contain the count of number of rows per code
, type
and value
.创建一个名为
count
的辅助列,其中包含每个code
、 type
和value
的行数。 Then sort the dataframe by count
and value
and drop the dupes per code
and type
keeping the last row.然后按
count
和value
对type
进行排序,并按code
删除重复项并保留最后一行。
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
If the value
column contains other values and you can't rely on alphabetical ordering use the following solution which is similar to solution proposed in case 1 but this first converts the value column to ordered Categorical
type before sorting如果
value
列包含其他值并且您不能依赖字母顺序,请使用以下解决方案,该解决方案类似于案例 1 中提出的解决方案,但这首先将值列转换为有序Categorical
类型,然后再进行排序
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df['value'] = pd.Categorical(df['value'], categories=['Acceptable', 'Undesirable'], ordered=True)
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
code type value count
1 1111 Golf Undesirable 1
4 1111 Basketball Undesirable 2
Another possible solution, which is based on the following ideas:另一种可能的解决方案,基于以下思路:
Grouping the data by code
and type
.按
code
和type
对数据进行分组。
If a group has more than one row ( len(x) > 1
) and its rows have the same count ( x['count'] == x['count'].min()).all()
), return the row with Undesirable
.如果一个组有不止一行(
len(x) > 1
)并且它的行有相同的计数( x['count'] == x['count'].min()).all()
) ,返回带有Undesirable
的行。
Otherwise, return the row where the count is maximum ( x.iloc[[x['count'].argmax()]]
).否则,返回计数最大的行 (
x.iloc[[x['count'].argmax()]]
)。
(df.groupby(['code', 'type', 'value'])['value'].size()
.reset_index(name='count').groupby(['code', 'type'])
.apply(lambda x: x.loc[x['value'] == 'Undesirable'] if
((len(x) > 1) and (x['count'] == x['count'].min()).all()) else
x.iloc[[x['count'].argmax()]])
.reset_index(drop=True)
.drop('count', axis=1))
Output: Output:
code type value
0 1111 Basketball Undesirable
1 1111 Golf Undesirable
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.