简体   繁体   中英

Pandas: Group by combination of two columns in Pandas 0.23.4

I am fairly new to Python. I came across Pandas: Group by combination of two columns on SO. Unfortunately, the accepted answer no longer works with pandas version 0.23.4 The objective of that post is to figure out combination of group variables, and create a dictionary for values. ie group_by should ignore the order of grouping.

Here's the accepted answer:

import pandas as pd
from collections import Counter

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Here, ...apply(sorted) throws the following exception:

raise ValueError('Must have equal len keys and value ' ValueError: Must have equal len keys and value when setting with an iterable

Here's my pandas version:

> pd.__version__
Out: '0.23.4'

Here's what I tried after reading https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html :

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Unfortunately, this also throws error:

1382, in _get_label_or_level_values raise KeyError(key) KeyError: 'x'

Expected output:

        score           count
x   y                     
a   b   {1: 1, 3: 2}      2
    c   {2: 1}            1 

Can someone please help me? On a side note, it will be great if you could also guide on how to compute the count of keys() in score column. I am looking for a vectorized solution.

I am using python 3.6.7

Many thanks.

Problem is sorted return lists, so is necessary convert ti to Series :

d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)

But faster is use numpy.sort with DataFrame constructor, because apply are loops under the hood:

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)

Then seelct column for aggregation with list of aggregated functions - eg nunique for count of number of unique values:

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
          Counter  nunique
x y                       
a b  {1: 1, 3: 2}        2
  c        {2: 1}        1

Or count by DataFrameGroupBy.size :

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
          Counter  size
x y                    
a b  {1: 1, 3: 2}     3
  c        {2: 1}     1

Use -

a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Output

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}

Adding result_type = 'broadcast' as one of the args to .apply() worked.

>>> d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
             columns=['x', 'y', 'score'])
>>> d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
>>> x = d.groupby(['x', 'y']).agg(Counter)
>>> print(x)

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}

Note the difference with and without result_type = 'broadcast' .

>>> d[['x', 'y']].apply(sorted, axis=1)

0    [a, b]
1    [a, c]
2    [a, b]
3    [a, b]
dtype: object

>>> d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')

   x  y
0  a  b
1  a  c
2  a  b
3  a  b

As you can see, result_type = 'broadcast' splits ( broadcasts ) the result of .apply() back from a list into the respective columns, allowing the assignment to d[['x', 'y']] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM