[英]Pandas: Group by combination of two columns in Pandas 0.23.4
I am fairly new to Python. 我对Python很新。 I came across Pandas: Group by combination of two columns on SO. 我遇到了Pandas:在SO上合并了两个专栏 。 Unfortunately, the accepted answer no longer works with pandas version 0.23.4
The objective of that post is to figure out combination of group variables, and create a dictionary for values. 不幸的是,接受的答案不再适用于pandas版本0.23.4
。该帖子的目标是找出组变量的组合,并创建值的字典。 ie group_by
should ignore the order of grouping. 即group_by
应该忽略分组的顺序。
Here's the accepted answer: 这是接受的答案:
import pandas as pd
from collections import Counter
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Here, ...apply(sorted)
throws the following exception: 这里, ...apply(sorted)
抛出以下异常:
raise ValueError('Must have equal len keys and value ' ValueError: Must have equal len keys and value when setting with an iterable 提高ValueError('必须具有相等的len键和值'ValueError:使用iterable设置时必须具有相等的len键和值
Here's my pandas version: 这是我的熊猫版本:
> pd.__version__
Out: '0.23.4'
Here's what I tried after reading https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html : 这是我在阅读https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html后尝试的内容:
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Unfortunately, this also throws error: 不幸的是,这也会引发错误:
1382, in _get_label_or_level_values raise KeyError(key) KeyError: 'x' 1382,在_get_label_or_level_values中引发KeyError(key)KeyError:'x'
Expected output: 预期产量:
score count
x y
a b {1: 1, 3: 2} 2
c {2: 1} 1
Can someone please help me? 有人可以帮帮我吗? On a side note, it will be great if you could also guide on how to compute the count of keys()
in score
column. 另外,如果您还可以指导如何计算score
列中的keys()
,那将会很棒。 I am looking for a vectorized solution. 我正在寻找一个矢量化解决方案。
I am using python 3.6.7
我使用的是python 3.6.7
Many thanks. 非常感谢。
Problem is sorted
return lists, so is necessary convert ti to Series
: 问题是sorted
返回列表,因此必须将ti转换为Series
:
d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)
But faster is use numpy.sort
with DataFrame
constructor, because apply
are loops under the hood: 但是使用numpy.sort
与DataFrame
构造函数更快,因为apply
是引擎盖下的循环:
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)
Then seelct column for aggregation with list of aggregated functions - eg nunique
for count of number of unique values: 然后选择用于聚合的列和聚合函数列表 - 例如,用于计算唯一值的数量的nunique
:
x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
Counter nunique
x y
a b {1: 1, 3: 2} 2
c {2: 1} 1
Or count by DataFrameGroupBy.size
: 或者按DataFrameGroupBy.size
:
x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
Counter size
x y
a b {1: 1, 3: 2} 3
c {2: 1} 1
Use - 使用 -
a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Output 产量
score
x y
a b {1: 1, 3: 2}
c {2: 1}
Adding result_type = 'broadcast'
as one of the args to .apply()
worked. 添加result_type = 'broadcast'
作为.apply()
之一的args工作。
>>> d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
>>> d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
>>> x = d.groupby(['x', 'y']).agg(Counter)
>>> print(x)
score
x y
a b {1: 1, 3: 2}
c {2: 1}
Note the difference with and without result_type = 'broadcast'
. 注意使用和不使用result_type = 'broadcast'
的区别。
>>> d[['x', 'y']].apply(sorted, axis=1)
0 [a, b]
1 [a, c]
2 [a, b]
3 [a, b]
dtype: object
>>> d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
x y
0 a b
1 a c
2 a b
3 a b
As you can see, result_type = 'broadcast'
splits ( broadcasts ) the result of .apply()
back from a list into the respective columns, allowing the assignment to d[['x', 'y']]
. 如您所见, result_type = 'broadcast'
将.apply()
)的结果从列表中拆分( 广播 )到相应的列中,允许赋值给d[['x', 'y']]
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.