I am fairly new to Python. I came across Pandas: Group by combination of two columns on SO. Unfortunately, the accepted answer no longer works with pandas version 0.23.4
The objective of that post is to figure out combination of group variables, and create a dictionary for values. ie group_by
should ignore the order of grouping.
Here's the accepted answer:
import pandas as pd
from collections import Counter
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Here, ...apply(sorted)
throws the following exception:
raise ValueError('Must have equal len keys and value ' ValueError: Must have equal len keys and value when setting with an iterable
Here's my pandas version:
> pd.__version__
Out: '0.23.4'
Here's what I tried after reading https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html :
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Unfortunately, this also throws error:
1382, in _get_label_or_level_values raise KeyError(key) KeyError: 'x'
Expected output:
score count
x y
a b {1: 1, 3: 2} 2
c {2: 1} 1
Can someone please help me? On a side note, it will be great if you could also guide on how to compute the count of keys()
in score
column. I am looking for a vectorized solution.
I am using python 3.6.7
Many thanks.
Problem is sorted
return lists, so is necessary convert ti to Series
:
d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)
But faster is use numpy.sort
with DataFrame
constructor, because apply
are loops under the hood:
d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)
Then seelct column for aggregation with list of aggregated functions - eg nunique
for count of number of unique values:
x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
Counter nunique
x y
a b {1: 1, 3: 2} 2
c {2: 1} 1
Or count by DataFrameGroupBy.size
:
x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
Counter size
x y
a b {1: 1, 3: 2} 3
c {2: 1} 1
Use -
a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)
Output
score
x y
a b {1: 1, 3: 2}
c {2: 1}
Adding result_type = 'broadcast'
as one of the args to .apply()
worked.
>>> d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
columns=['x', 'y', 'score'])
>>> d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
>>> x = d.groupby(['x', 'y']).agg(Counter)
>>> print(x)
score
x y
a b {1: 1, 3: 2}
c {2: 1}
Note the difference with and without result_type = 'broadcast'
.
>>> d[['x', 'y']].apply(sorted, axis=1)
0 [a, b]
1 [a, c]
2 [a, b]
3 [a, b]
dtype: object
>>> d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
x y
0 a b
1 a c
2 a b
3 a b
As you can see, result_type = 'broadcast'
splits ( broadcasts ) the result of .apply()
back from a list into the respective columns, allowing the assignment to d[['x', 'y']]
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.