简体   繁体   中英

pandas groupby when one record belongs to more than one group

I would like to be able to produce summary statistics and pivot tables from a dataset that can be grouped in multiple ways. The complication arises because each entry can belong to more than one group within one categorisation axis (see example below).

So far, I have found a solution based on multi-indexing and repeating each record as many times as it appears in category1*category2 combinations. However, this seems to be inflexible (I will need to check whether an entry appears in the same categories across different data sources, I might want to add another category system that would be called category3, a category "d" might get added to the category1 system, etc.). Moreover, it seems to go against basic principles of database design.

My Question is: Is there any other (more elegant, flexible) way to solve this problem than my solution below? I could imagine keeping various tables, one with the actual data, and others with the grouping information (much like the stack table below) and using these flexibly as input to Groupby, but I don't know if that is possible and how to make that work. Any suggestions for improvements are also welcome. Thanks!

the raw data comes as something like:

import pandas

data={'ID' : [1 , 2, 3, 4],
    'year' : [2004, 2008 , 2006, 2009],
      'money' : [10000 , 5000, 4000, 11500],
      'categories1' : [ "a,b,c" , "c" , "a,c" , ""  ],
     'categories2' : ["one, two" , "one" , "five" , "eight"]}
df= pandas.DataFrame(data)
df.set_index('ID', inplace=True)
print df

Which gives:

   categories1 categories2  money  year
ID                                     
1        a,b,c    one, two  10000  2004
2            c         one   5000  2008
3          a,c        five   4000  2006
4                    eight  11500  2009

I want to be able to make pivot tables that look like this:

Average money
year      2004   2005   2006   2007
category  
a         
b
c

and also:

Average money
category2      one  two   three    four   
category1  
a         
b
c

So far, I have:

Step1: extracted the categories information using get_dummies:

cat1=df['categories1'].str.get_dummies(sep=",")
print cat1

Which gives:

    a  b  c
ID         
1   1  1  1
2   0  0  1
3   1  0  1
4   0  0  0

Step 2: stacked this:

stack = cat1.stack()
stack.index.names=['ID', 'cat1']
stack.name='in_cat1'
​print stack

Which gives:

ID  cat1
1   a       1
    b       1
    c       1
2   a       0
    b       0
    c       1
3   a       1
    b       0
    c       1
4   a       0
    b       0
    c       0
Name: in_cat1, dtype: int64

Step 3: joined that onto the original data frame to create a multi-indexed data frame

dl = df.join(stack, how='inner')
print dl

Which looks like this:

        categories1 categories2  money  year  in_cat1
ID cat1                                              
1  a          a,b,c    one, two  10000  2004        1
   b          a,b,c    one, two  10000  2004        1
   c          a,b,c    one, two  10000  2004        1
2  a              c         one   5000  2008        0
   b              c         one   5000  2008        0
   c              c         one   5000  2008        1
3  a            a,c        five   4000  2006        1
   b            a,c        five   4000  2006        0
   c            a,c        five   4000  2006        1
4  a                      eight  11500  2009        0
   b                      eight  11500  2009        0
   c                      eight  11500  2009        0

Step 4: which is then usable with pandas groupby and pivot_table commands

dl.reset_index(level=1, inplace=True)
pt= dl.pivot_table(values='money', columns='year', index='cat1')
print pt

and does what I want:

year   2004  2006  2008   2009
cat1                          
a     10000  4000  5000  11500
b     10000  4000  5000  11500
c     10000  4000  5000  11500

​ I have repeated steps 2 + 3 with category2, so that now the dataframe has 3-level indexing.

I created a function that takes a DataFrame and a column name. It's expected that the column specified by the column name has a string that can be split by ',' . It will append this split to the index with the appropriate name.

def expand_and_add(df, col):
    expand = lambda x: pd.concat([x for i in x[col].split(',')], keys=x[col].split(','))
    df = df.apply(expand, axis=1).stack(0)
    df.index.levels[-1].name = col
    df.drop(col, axis=1, inplace=1)
    return df

Now this will help create the 3 layers of MultiIndex . I do believe manipulating the MultiIndex provides all the flexibility you need to create the pivots you want.

new_df = expand_and_add(expand_and_add(df, 'categories1'), 'categories2')

Looks like:

money    year
ID categories1 categories2                 
1  a            two         10000.0  2004.0
                one          10000.0  2004.0
   b            two         10000.0  2004.0
                one          10000.0  2004.0
   c            two         10000.0  2004.0
                one          10000.0  2004.0
2  c           one           5000.0  2008.0
3  a           five          4000.0  2006.0
   c           five          4000.0  2006.0
4              eight        11500.0  2009.0

Your pivots are still going to be individually messy but here are some.

mean [categories1, year]

new_df.set_index(ndf.year.astype(int), append=True)['money'].groupby(level=[1, 3]).mean().unstack()

year            2004    2006    2008     2009
categories1                                  
                 NaN     NaN     NaN  11500.0
a            10000.0  4000.0     NaN      NaN
b            10000.0     NaN     NaN      NaN
c            10000.0  4000.0  5000.0      NaN

mean [categories1, categories2]

new_df.groupby(level=[1, 2])['money'].mean().unstack()

categories2      two    eight    five      one
categories1                                   
                 NaN  11500.0     NaN      NaN
a            10000.0      NaN  4000.0  10000.0
b            10000.0      NaN     NaN  10000.0
c            10000.0      NaN  4000.0   7500.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM