简体   繁体   English

每个柱子的不同石斑鱼与熊猫GroupBy

[英]Different groupers for each column with pandas GroupBy

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? 我怎么能使用多维Grouper,在这种情况下是另一个数据帧,作为另一个数据帧的Grouper? Can it be done in one step? 可以一步完成吗?

My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum . 我的问题主要是关于如何在这些情况下执行实际的分组,但为了使其更具体,说我想transform并获取sum

Consider for example: 考虑例如:

df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})

print(df1)
   a  b
0  1  5
1  2  6
2  3  7
3  4  8

df2  = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})

print(df2)
   a  b
0  A  A
1  B  A
2  A  B
3  B  B

Then, the expected output would be: 然后,预期的输出将是:

   a  b
0  4  11
1  6  11
2  4  15
3  6  15

Where columns a and b in df1 have been grouped by columns a and b from df2 respectively. 凡列abdf1已经被列分组abdf2分别。

Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe: 尝试使用apply将lambda函数应用于数据帧的每一列,然后使用该pd.Series的名称按第二个数据帧分组:

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

Output: 输出:

   a   b
0  4  11
1  6  11
2  4  15
3  6  15

You will have to group each column individually since each column uses a different grouping scheme. 您必须单独对每列进行分组,因为每列使用不同的分组方案。

If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series: 如果你想要一个更干净的版本,我建议对列名称进行列表理解,并在结果系列上调用pd.concat

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

   a   b
0  4  11
1  6  11
2  4  15
3  6  15

Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply , so this is my suggestion :-) 不是说使用apply在其他答案中有什么问题,只是因为我不喜欢apply ,所以这是我的建议:-)


Here are some timeits for your perusal. 以下是您的细读时间。 Just for your sample data, you will notice the difference in timings is obvious. 只是为了您的样本数据,您会注意到时间上的差异是显而易见的。

%%timeit 
(df1.stack()
    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
    .transform('sum').unstack())
%%timeit 
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit 
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Not to say apply is slow, but explicit iteration in this case is faster. 不是说apply很慢,但在这种情况下显式迭代更快。 Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns. 此外,您会注意到,由于迭代次数取决于列数,因此第二次和第三次定时解决方案将以更大的长度v / s宽度进行更好的扩展。

Using stack and unstack 使用stackunstack stack

df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]: 
   a   b
0  4  11
1  6  11
2  4  15
3  6  15

I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column. 我要提出一个(大部分)numpythonic使用的解决方案scipy.sparse_matrix进行量化groupby对整个数据帧一次由列,而不是列。


The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. 有效执行此操作的关键是找到一种高效的方法来分解整个DataFrame,同时避免任何列中的重复。 Since your groups are represented by strings, you can simply concatenate the column name on the end of each value (since columns should be unique), and then factorize the result, like so [*] 由于您的组由字符串表示,您可以简单地在每个值的末尾连接列名称(因为列应该是唯一的),然后分解结果,如此[*]

>>> df2 + df2.columns
    a   b
0  Aa  Ab
1  Ba  Ab
2  Aa  Bb
3  Ba  Bb

>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))

Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape. 一旦我们有了一个唯一的分组,我们就可以利用我们的scipy.sparse矩阵,在flattened数组上一次执行groupby,并使用高级索引和整形操作将结果转换回原始形状。

from scipy import sparse

a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())

o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1

res = o[b].reshape(df1.shape)

array([[ 4, 11],
       [ 6, 11],
       [ 4, 15],
       [ 6, 15]], dtype=int64)

Performance 性能

Functions 功能

def gp_chris(f1, f2):
    a = f1.values.ravel()
    b, _ = pd.factorize((f2 + f2.columns).values.ravel())

    o = sparse.csr_matrix(
        (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)


def gp_cs(f1, f2):
    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)


def gp_scott(f1, f2):
    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))


def gp_wen(f1, f2):
    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()

Setup 建立

import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
       index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
       columns=[10, 50, 100, 200, 400],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        df1 = pd.DataFrame(np.random.rand(c, c))
        df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
        df1.columns = df1.columns.astype(str)
        df2.columns = df2.columns.astype(str)

        stmt = '{}(df1, df2)'.format(f)
        setp = 'from __main__ import df1, df2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)


ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

plt.show()

Results 结果

在此输入图像描述


Validation 验证

df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)

v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))

True

Either we're all wrong or we're all correct :) 要么我们都错了,要么我们都是正确的:)


[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. [*]如果在连接发生之前一个项目是列和另一个项目的串联,则有可能在此处获得重复值。 However if this is the case, you shouldn't need to adjust much to fix it. 但是,如果是这种情况,您不需要调整太多来修复它。

You could do something like the following: 您可以执行以下操作:

res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))\
         .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))

Results: 结果:

   a   b
0  4  11
1  6  11
2  4  15
3  6  15

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM