简体   繁体   中英

Different groupers for each column with pandas GroupBy

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?

My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum .

Consider for example:

df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})

print(df1)
   a  b
0  1  5
1  2  6
2  3  7
3  4  8

df2  = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})

print(df2)
   a  b
0  A  A
1  B  A
2  A  B
3  B  B

Then, the expected output would be:

   a  b
0  4  11
1  6  11
2  4  15
3  6  15

Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.

Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

Output:

   a   b
0  4  11
1  6  11
2  4  15
3  6  15

You will have to group each column individually since each column uses a different grouping scheme.

If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

   a   b
0  4  11
1  6  11
2  4  15
3  6  15

Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply , so this is my suggestion :-)


Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.

%%timeit 
(df1.stack()
    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
    .transform('sum').unstack())
%%timeit 
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit 
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.

Using stack and unstack

df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]: 
   a   b
0  4  11
1  6  11
2  4  15
3  6  15

I'm going to propose a (mostly) numpythonic solution that uses a scipy.sparse_matrix to perform a vectorized groupby on the entire DataFrame at once, rather than column by column.


The key to performing this operation efficiently is finding a performant way to factorize the entire DataFrame, while avoiding duplicates in any columns. Since your groups are represented by strings, you can simply concatenate the column name on the end of each value (since columns should be unique), and then factorize the result, like so [*]

>>> df2 + df2.columns
    a   b
0  Aa  Ab
1  Ba  Ab
2  Aa  Bb
3  Ba  Bb

>>> pd.factorize((df2 + df2.columns).values.ravel())
(array([0, 1, 2, 1, 0, 3, 2, 3], dtype=int64),
array(['Aa', 'Ab', 'Ba', 'Bb'], dtype=object))

Once we have a unique grouping, we can utilize our scipy.sparse matrix, to perform a groupby in a single pass on the flattened arrays, and use advanced indexing and a reshaping operation to convert the result back to the original shape.

from scipy import sparse

a = df1.values.ravel()
b, _ = pd.factorize((df2 + df2.columns).values.ravel())

o = sparse.csr_matrix(
    (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
).sum(0).A1

res = o[b].reshape(df1.shape)

array([[ 4, 11],
       [ 6, 11],
       [ 4, 15],
       [ 6, 15]], dtype=int64)

Performance

Functions

def gp_chris(f1, f2):
    a = f1.values.ravel()
    b, _ = pd.factorize((f2 + f2.columns).values.ravel())

    o = sparse.csr_matrix(
        (a, b, np.arange(a.shape[0] + 1)), (a.shape[0], b.max() + 1)
    ).sum(0).A1

    return pd.DataFrame(o[b].reshape(f1.shape), columns=df1.columns)


def gp_cs(f1, f2):
    return pd.concat([f1[c].groupby(f2[c]).transform('sum') for c in f1.columns], axis=1)


def gp_scott(f1, f2):
    return f1.apply(lambda x: x.groupby(f2[x.name]).transform('sum'))


def gp_wen(f1, f2):
    return f1.stack().groupby([f2.stack().index.get_level_values(level=1), f2.stack()]).transform('sum').unstack()

Setup

import numpy as np
from scipy import sparse
import pandas as pd
import string
from timeit import timeit
import matplotlib.pyplot as plt
res = pd.DataFrame(
       index=[f'gp_{f}' for f in ('chris', 'cs', 'scott', 'wen')],
       columns=[10, 50, 100, 200, 400],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        df1 = pd.DataFrame(np.random.rand(c, c))
        df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (c, c)))
        df1.columns = df1.columns.astype(str)
        df2.columns = df2.columns.astype(str)

        stmt = '{}(df1, df2)'.format(f)
        setp = 'from __main__ import df1, df2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)


ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

plt.show()

Results

在此输入图像描述


Validation

df1 = pd.DataFrame(np.random.rand(10, 10))
df2 = pd.DataFrame(np.random.choice(list(string.ascii_uppercase), (10, 10)))
df1.columns = df1.columns.astype(str)
df2.columns = df2.columns.astype(str)

v = np.stack([gp_chris(df1, df2), gp_cs(df1, df2), gp_scott(df1, df2), gp_wen(df1, df2)])
print(np.all(v[:-1] == v[1:]))

True

Either we're all wrong or we're all correct :)


[*] There is a possibility that you could get a duplicate value here if one item is the concatenation of a column and another item before concatenation occurs. However if this is the case, you shouldn't need to adjust much to fix it.

You could do something like the following:

res = df1.assign(a_sum=lambda df: df['a'].groupby(df2['a']).transform('sum'))\
         .assign(b_sum=lambda df: df['b'].groupby(df2['b']).transform('sum'))

Results:

   a   b
0  4  11
1  6  11
2  4  15
3  6  15

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM