简体   繁体   中英

Grouping by multiple criteria in pandas

I have a pandas data structure such as this:

>>> df
        Benny  Daniel   Doris   Eric   Jack    Zoe
Age        75      30      95     25     28     23
Salary   2000    9000  100000  10000  12000  20000 

I would like to find the mean age and salary for several different groups, where each is a subset of the columns, and they may overlap, such as this dictionary for example:

{'Parrot lovers': ['Doris', 'Benny'], 'Tea Drinkers': ['Doris', 'Zoe'],\
 'Maintainance': ['Benny', 'Jack'], 'Coffee Drinkers': ['Benny', 'Eric'],\
 'Senior Management': ['Doris', 'Zoe', 'Jack']}

How can I design a groupby function that will do this?

Here is how I set up the problem...

import StringIO
import pandas as pd

df = """index  Benny  Daniel   Doris   Eric   Jack    Zoe
Age        75      30      95     25     28     23
Salary   2000    9000  100000  10000  12000  20000"""
df = pd.read_csv(StringIO.StringIO(df),sep="\s+").set_index('index')
d = {'Parrot lovers': ['Doris', 'Benny'], 'Tea Drinkers': ['Doris', 'Zoe'],\
 'Maintainance': ['Benny', 'Jack'], 'Coffee Drinkers': ['Benny', 'Eric'],\
 'Senior Management': ['Doris', 'Zoe', 'Jack']}

For the solution Just Use .loc and iterate through the dictionary...

averages = {k:df.loc[:,v].mean(axis=1) for k,v in d.iteritems()}
print pd.DataFrame(averages).T #gives the nice printout...

index                    Age  Salary
Coffee Drinkers    50.000000    6000
Maintainance       51.500000    7000
Parrot lovers      85.000000   51000
Senior Management  48.666667   44000
Tea Drinkers       59.000000   60000

There are probably a handful of ways to do this, here's one path.

Transpose your data, and add a True/False column for category:

In [20]: group_map = {'Parrot lovers': ['Doris', 'Benny'], 
                      'Tea Drinkers': ['Doris', 'Zoe'],
                      'Maintainance': ['Benny', 'Jack'], 
                      'Coffee Drinkers': ['Benny', 'Eric'], 
                      'Senior Management': ['Doris', 'Zoe', 'Jack']}
In [22]: df = df.T
In [23]: for k in group_map:
    ...:     df[k] = df.index.isin(group_map[k])

Now, you can groupby any category to get means:

In [24]: df.groupby('Parrot lovers')['Salary'].mean()
Out[24]: 
Parrot lovers
False            12750
True             51000
Name: Salary, dtype: int64

Or, iterate over the columns to get the mean for each category.

In [24]: means = {}
    ...: for k in group_map:
    ...:     means[k] = df.groupby(k)['Salary'].mean()[True]
    ...: means
    ...: 
Out[24]: 
{'Coffee Drinkers': 6000,
 'Maintainance': 7000,
 'Parrot lovers': 51000,
 'Senior Management': 44000,
 'Tea Drinkers': 60000}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM