I would like to be able to produce summary statistics and pivot tables from a dataset that can be grouped in multiple ways. The complication arises because each entry can belong to more than one group within one categorisation axis (see example below).
So far, I have found a solution based on multi-indexing and repeating each record as many times as it appears in category1*category2 combinations. However, this seems to be inflexible (I will need to check whether an entry appears in the same categories across different data sources, I might want to add another category system that would be called category3, a category "d" might get added to the category1 system, etc.). Moreover, it seems to go against basic principles of database design.
My Question is: Is there any other (more elegant, flexible) way to solve this problem than my solution below? I could imagine keeping various tables, one with the actual data, and others with the grouping information (much like the stack table below) and using these flexibly as input to Groupby, but I don't know if that is possible and how to make that work. Any suggestions for improvements are also welcome. Thanks!
the raw data comes as something like:
import pandas
data={'ID' : [1 , 2, 3, 4],
'year' : [2004, 2008 , 2006, 2009],
'money' : [10000 , 5000, 4000, 11500],
'categories1' : [ "a,b,c" , "c" , "a,c" , "" ],
'categories2' : ["one, two" , "one" , "five" , "eight"]}
df= pandas.DataFrame(data)
df.set_index('ID', inplace=True)
print df
Which gives:
categories1 categories2 money year
ID
1 a,b,c one, two 10000 2004
2 c one 5000 2008
3 a,c five 4000 2006
4 eight 11500 2009
I want to be able to make pivot tables that look like this:
Average money
year 2004 2005 2006 2007
category
a
b
c
and also:
Average money
category2 one two three four
category1
a
b
c
So far, I have:
Step1: extracted the categories information using get_dummies:
cat1=df['categories1'].str.get_dummies(sep=",")
print cat1
Which gives:
a b c
ID
1 1 1 1
2 0 0 1
3 1 0 1
4 0 0 0
Step 2: stacked this:
stack = cat1.stack()
stack.index.names=['ID', 'cat1']
stack.name='in_cat1'
print stack
Which gives:
ID cat1
1 a 1
b 1
c 1
2 a 0
b 0
c 1
3 a 1
b 0
c 1
4 a 0
b 0
c 0
Name: in_cat1, dtype: int64
Step 3: joined that onto the original data frame to create a multi-indexed data frame
dl = df.join(stack, how='inner')
print dl
Which looks like this:
categories1 categories2 money year in_cat1
ID cat1
1 a a,b,c one, two 10000 2004 1
b a,b,c one, two 10000 2004 1
c a,b,c one, two 10000 2004 1
2 a c one 5000 2008 0
b c one 5000 2008 0
c c one 5000 2008 1
3 a a,c five 4000 2006 1
b a,c five 4000 2006 0
c a,c five 4000 2006 1
4 a eight 11500 2009 0
b eight 11500 2009 0
c eight 11500 2009 0
Step 4: which is then usable with pandas groupby and pivot_table commands
dl.reset_index(level=1, inplace=True)
pt= dl.pivot_table(values='money', columns='year', index='cat1')
print pt
and does what I want:
year 2004 2006 2008 2009
cat1
a 10000 4000 5000 11500
b 10000 4000 5000 11500
c 10000 4000 5000 11500
I have repeated steps 2 + 3 with category2, so that now the dataframe has 3-level indexing.
I created a function that takes a DataFrame
and a column name. It's expected that the column specified by the column name has a string that can be split by ','
. It will append this split to the index with the appropriate name.
def expand_and_add(df, col):
expand = lambda x: pd.concat([x for i in x[col].split(',')], keys=x[col].split(','))
df = df.apply(expand, axis=1).stack(0)
df.index.levels[-1].name = col
df.drop(col, axis=1, inplace=1)
return df
Now this will help create the 3 layers of MultiIndex
. I do believe manipulating the MultiIndex
provides all the flexibility you need to create the pivots you want.
new_df = expand_and_add(expand_and_add(df, 'categories1'), 'categories2')
Looks like:
money year
ID categories1 categories2
1 a two 10000.0 2004.0
one 10000.0 2004.0
b two 10000.0 2004.0
one 10000.0 2004.0
c two 10000.0 2004.0
one 10000.0 2004.0
2 c one 5000.0 2008.0
3 a five 4000.0 2006.0
c five 4000.0 2006.0
4 eight 11500.0 2009.0
Your pivots are still going to be individually messy but here are some.
new_df.set_index(ndf.year.astype(int), append=True)['money'].groupby(level=[1, 3]).mean().unstack()
year 2004 2006 2008 2009
categories1
NaN NaN NaN 11500.0
a 10000.0 4000.0 NaN NaN
b 10000.0 NaN NaN NaN
c 10000.0 4000.0 5000.0 NaN
new_df.groupby(level=[1, 2])['money'].mean().unstack()
categories2 two eight five one
categories1
NaN 11500.0 NaN NaN
a 10000.0 NaN 4000.0 10000.0
b 10000.0 NaN NaN 10000.0
c 10000.0 NaN 4000.0 7500.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.