简体   繁体   English

具有类别变量组合的GroupBy

[英]A GroupBy with combinations of the categorical variables

Let's say I have data: 假设我有数据:

pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])

which gives: 这使:

       column
index
a           1
b           2
c           3
a           4
b           1
c           2

Then to get the mean of each subgroup one would: 然后,要获得每个子组的均值,将:

df.groupby(df.index).mean()

       column
index
a         2.5
b         1.5
c         2.5

However, what I've been trying to achieve without constantly looping and slicing the data, is how do I get the mean for pairs of subgroups? 但是,我一直试图在不不断循环和切片数据的情况下实现的目标是,如何获得成对的子组平均值?

For instance, the mean of a & b is 2? 例如,a&b的均值为2? As if their values were combined. 好像它们的价值相结合。

The output would be something akin to: 输出将类似于:

       column
index
a & a     2.5
a & b     2.0
a & c     2.5
b & b     1.5
b & c     2.0
c & c     2.5

Preferably this would involve manipulating the parameters in 'groupby' but as it is, I'm having to resort to looping and slicing. 最好是,这涉及在“ groupby”中操纵参数,但实际上,我不得不求助于循环和切片。 With the ability to build all combinations of subgroups at some point. 具有在某个时候建立子组的所有组合的能力。

I've revisited this 3 years later with a general solution to this problem. 三年后,我将对这个问题进行总体解决,以重新审视。

It's being used in this open source library, which is why I'm now able to do this here and it works with any number of indexes and creates combinations on them using numpy matrix broadcasting 它已在此开放源代码库中使用,这就是为什么我现在可以在此处执行此操作的原因并且它可以与任意数量的索引一起使用,并使用numpy矩阵广播在其上创建组合

So first of all, that is not a valid dataframe . 因此,首先, 这不是有效的dataframe The indexes aren't unique. 索引不是唯一的。 Let's add another index to that object and make it a Series: 让我们向该对象添加另一个索引并使其成为系列:

df = pd.DataFrame({
    'unique': [1, 2, 3, 4, 5, 6], 
    'index': ['a','b','c','a','b','c'], 
    'column': [1,2,3,4,1,2]
}).set_index(['unique','index'])
s = df['column']

Let's unstack that index: 让我们拆开该索引:

>>> idxs = ['index'] # set as variable to be used later on
>>> unstacked = s.unstack(idxs)
       column
index       a    b    c
unique
1         1.0  NaN  NaN
2         NaN  2.0  NaN
3         NaN  NaN  3.0
4         4.0  NaN  NaN
5         NaN  1.0  NaN
6         NaN  NaN  2.0
>>> vals = unstacked.values
array([[  1.,  nan,  nan],
       [ nan,   2.,  nan],
       [ nan,  nan,   3.],
       [  4.,  nan,  nan],
       [ nan,   1.,  nan],
       [ nan,  nan,   2.]])
>>> sum = np.nansum(vals, axis=0)
>>> count = (~np.isnan(vals)).sum(axis=0)
>>> mean = (sum + sum[:, np.newaxis]) / (count + count[:, np.newaxis])
array([[ 2.5,  2. ,  2.5],
       [ 2. ,  1.5,  2. ],
       [ 2.5,  2. ,  2.5]])

Now recreate the output dataframe: 现在重新创建输出数据框:

>>> new_df = pd.DataFrame(mean, unstacked.columns, unstacked.columns.copy())
index_    a    b    c
index
a       2.5  2.0  2.5
b       2.0  1.5  2.0
c       2.5  2.0  2.5
>>> idxs_ = [ x+'_' for x in idxs ]
>>> new_df.columns.names = idxs_
>>> new_df.stack(idxs_, dropna=False)
index  index_
a      a         2.5
       b         2.0
       c         2.5
b      a         2.0
       b         1.5
       c         2.0
c      a         2.5
       b         2.0
       c         2.5

My current implementation is: 我当前的实现是:

 import pandas as pd
 import itertools
 import numpy as np

    # get all pair of categories here
def all_pairs(df, ix):
    hash = {
        ix: [],
        'p': []
    }
    for subset in itertools.combinations(np.unique(np.array(df.index)), 2):
        hash[ix].append(subset)
        hash['p'].append(df.loc[pd.IndexSlice[subset], :]).mean)

    return pd.DataFrame(hash).set_index(ix)

Which gets the combinations and then adds them to the has that then builds back up into a data frame. 它将获取组合,然后将其添加到has中,然后将其重新构建到数据帧中。 It's hacky though :( 虽然很hacky :(

Here's an implementation that uses a MultiIndex and an outer join to handle the cross join. 这是一个使用MultiIndex和外部联接来处理交叉联接的实现。

import pandas as pd
from pandas import DataFrame, Series
import numpy as np

df = pd.DataFrame({'index': ['a','b','c','a','b','c'], 'column': [1,2,3,4,1,2]}).set_index(['index'])

groupedDF = df.groupby(df.index).mean()
# Create new MultiIndex using from_product which gives a paring of the elements in each iterable
p = pd.MultiIndex.from_product([groupedDF.index, groupedDF.index])
# Add column for cross join
groupedDF[0] = 0
# Outer Join
groupedDF = pd.merge(groupedDF, groupedDF, how='outer', on=0).set_index(p)
# get mean for every row (which is the average for each pair)
# unstack to get matrix for deduplication
crossJoinMeans = groupedDF[['column_x', 'column_y']].mean(axis=1).unstack()
# Create Identity matrix because each pair of itself will be needed
b = np.identity(3, dtype='bool')
# set the first column to True because it contains the rest of the unique means (the identity portion covers the first part)
b[:,0] = True
# invert for proper use of DataFrame Mask
b = np.invert(b)
finalDF = crossJoinMeans.mask(b).stack()

I'd guess that this could be cleaned up and made more concise. 我猜想可以对此进行清理并使其更加简洁。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM