熊猫：在Pandas 0.23.4中组合两列

Question

I am fairly new to Python. 我对Python很新。 I came across Pandas: Group by combination of two columns on SO. 我遇到了Pandas：在SO上合并了两个专栏。 Unfortunately, the accepted answer no longer works with pandas version 0.23.4 The objective of that post is to figure out combination of group variables, and create a dictionary for values. 不幸的是，接受的答案不再适用于pandas版本0.23.4 。该帖子的目标是找出组变量的组合，并创建值的字典。 ie group_by should ignore the order of grouping. 即group_by应该忽略分组的顺序。

Here's the accepted answer: 这是接受的答案：

import pandas as pd
from collections import Counter

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Here, ...apply(sorted) throws the following exception: 这里， ...apply(sorted)抛出以下异常：

raise ValueError('Must have equal len keys and value ' ValueError: Must have equal len keys and value when setting with an iterable 提高ValueError（'必须具有相等的len键和值'ValueError：使用iterable设置时必须具有相等的len键和值

Here's my pandas version: 这是我的熊猫版本：

> pd.__version__
Out: '0.23.4'

Here's what I tried after reading https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html : 这是我在阅读https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html后尝试的内容：

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Unfortunately, this also throws error: 不幸的是，这也会引发错误：

1382, in _get_label_or_level_values raise KeyError(key) KeyError: 'x' 1382，在_get_label_or_level_values中引发KeyError（key）KeyError：'x'

Expected output: 预期产量：

        score           count
x   y                     
a   b   {1: 1, 3: 2}      2
    c   {2: 1}            1

Can someone please help me? 有人可以帮帮我吗？ On a side note, it will be great if you could also guide on how to compute the count of keys() in score column. 另外，如果您还可以指导如何计算score列中的keys() ，那将会很棒。 I am looking for a vectorized solution. 我正在寻找一个矢量化解决方案。

I am using python 3.6.7 我使用的是python 3.6.7

Many thanks. 非常感谢。

Answer 1

Problem is sorted return lists, so is necessary convert ti to Series : 问题是sorted返回列表，因此必须将ti转换为Series ：

d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)

But faster is use numpy.sort with DataFrame constructor, because apply are loops under the hood: 但是使用numpy.sort与DataFrame构造函数更快，因为apply是引擎盖下的循环：

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)

Then seelct column for aggregation with list of aggregated functions - eg nunique for count of number of unique values: 然后选择用于聚合的列和聚合函数列表 - 例如，用于计算唯一值的数量的nunique ：

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
          Counter  nunique
x y                       
a b  {1: 1, 3: 2}        2
  c        {2: 1}        1

Or count by DataFrameGroupBy.size : 或者按DataFrameGroupBy.size ：

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
          Counter  size
x y                    
a b  {1: 1, 3: 2}     3
  c        {2: 1}     1

Answer 2

Use - 使用 -

a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

Output 产量

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}

Answer 3

Adding result_type = 'broadcast' as one of the args to .apply() worked. 添加result_type = 'broadcast'作为.apply()之一的args工作。

>>> d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
             columns=['x', 'y', 'score'])
>>> d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
>>> x = d.groupby(['x', 'y']).agg(Counter)
>>> print(x)

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}

Note the difference with and without result_type = 'broadcast' . 注意使用和不使用result_type = 'broadcast'的区别。

>>> d[['x', 'y']].apply(sorted, axis=1)

0    [a, b]
1    [a, c]
2    [a, b]
3    [a, b]
dtype: object

>>> d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')

   x  y
0  a  b
1  a  c
2  a  b
3  a  b

As you can see, result_type = 'broadcast' splits ( broadcasts ) the result of .apply() back from a list into the respective columns, allowing the assignment to d[['x', 'y']] . 如您所见， result_type = 'broadcast'将.apply() ）的结果从列表中拆分（广播）到相应的列中，允许赋值给d[['x', 'y']] 。

熊猫：在Pandas 0.23.4中组合两列

问题描述

3 个解决方案

解决方案1
2 已采纳 2018-12-03 08:48:10

解决方案2
1 2018-12-03 08:49:53

解决方案3
1 2018-12-03 08:53:57

熊猫：在Pandas 0.23.4中组合两列

问题描述

3 个解决方案

解决方案1 2 已采纳 2018-12-03 08:48:10

解决方案2 1 2018-12-03 08:49:53

解决方案3 1 2018-12-03 08:53:57

解决方案1
2 已采纳 2018-12-03 08:48:10

解决方案2
1 2018-12-03 08:49:53

解决方案3
1 2018-12-03 08:53:57