简体   繁体   English

熊猫:在Pandas 0.23.4中组合两列

[英]Pandas: Group by combination of two columns in Pandas 0.23.4

I am fairly new to Python. 我对Python很新。 I came across Pandas: Group by combination of two columns on SO. 我遇到了Pandas:在SO上合并了两个专栏 Unfortunately, the accepted answer no longer works with pandas version 0.23.4 The objective of that post is to figure out combination of group variables, and create a dictionary for values. 不幸的是,接受的答案不再适用于pandas版本0.23.4 。该帖子的目标是找出组变量的组合,并创建值的字典。 ie group_by should ignore the order of grouping. group_by应该忽略分组的顺序。

Here's the accepted answer: 这是接受的答案:

import pandas as pd
from collections import Counter

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Here, ...apply(sorted) throws the following exception: 这里, ...apply(sorted)抛出以下异常:

raise ValueError('Must have equal len keys and value ' ValueError: Must have equal len keys and value when setting with an iterable 提高ValueError('必须具有相等的len键和值'ValueError:使用iterable设置时必须具有相等的len键和值

Here's my pandas version: 这是我的熊猫版本:

> pd.__version__
Out: '0.23.4'

Here's what I tried after reading https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html : 这是我在阅读https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html后尝试的内容:

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Unfortunately, this also throws error: 不幸的是,这也会引发错误:

1382, in _get_label_or_level_values raise KeyError(key) KeyError: 'x' 1382,在_get_label_or_level_values中引发KeyError(key)KeyError:'x'

Expected output: 预期产量:

        score           count
x   y                     
a   b   {1: 1, 3: 2}      2
    c   {2: 1}            1 

Can someone please help me? 有人可以帮帮我吗? On a side note, it will be great if you could also guide on how to compute the count of keys() in score column. 另外,如果您还可以指导如何计算score列中的keys() ,那将会很棒。 I am looking for a vectorized solution. 我正在寻找一个矢量化解决方案。

I am using python 3.6.7 我使用的是python 3.6.7

Many thanks. 非常感谢。

Problem is sorted return lists, so is necessary convert ti to Series : 问题是sorted返回列表,因此必须将ti转换为Series

d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)

But faster is use numpy.sort with DataFrame constructor, because apply are loops under the hood: 但是使用numpy.sortDataFrame构造函数更快,因为apply是引擎盖下的循环:

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)

Then seelct column for aggregation with list of aggregated functions - eg nunique for count of number of unique values: 然后选择用于聚合的列和聚合函数列表 - 例如,用于计算唯一值的数量的nunique

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
          Counter  nunique
x y                       
a b  {1: 1, 3: 2}        2
  c        {2: 1}        1

Or count by DataFrameGroupBy.size : 或者按DataFrameGroupBy.size

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
          Counter  size
x y                    
a b  {1: 1, 3: 2}     3
  c        {2: 1}     1

Use - 使用 -

a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Output 产量

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}

Adding result_type = 'broadcast' as one of the args to .apply() worked. 添加result_type = 'broadcast'作为.apply()之一的args工作。

>>> d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
             columns=['x', 'y', 'score'])
>>> d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
>>> x = d.groupby(['x', 'y']).agg(Counter)
>>> print(x)

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}

Note the difference with and without result_type = 'broadcast' . 注意使用和不使用result_type = 'broadcast'的区别。

>>> d[['x', 'y']].apply(sorted, axis=1)

0    [a, b]
1    [a, c]
2    [a, b]
3    [a, b]
dtype: object

>>> d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')

   x  y
0  a  b
1  a  c
2  a  b
3  a  b

As you can see, result_type = 'broadcast' splits ( broadcasts ) the result of .apply() back from a list into the respective columns, allowing the assignment to d[['x', 'y']] . 如您所见, result_type = 'broadcast'.apply() )的结果从列表中拆分( 广播 )到相应的列中,允许赋值给d[['x', 'y']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM