Pandas Generating dataframe based on columns being present

Question

I have a dataframe

import pandas as pd

df=pd.DataFrame({'Players': [ 'Sam', 'Greg', 'Steve', 'Sam',
                 'Greg', 'Steve', 'Greg', 'Steve', 'Greg', 'Steve'],
                 'Wins': [10,5,5,20,30,20,6,9,3,10],
                 'Losses': [5,5,5,2,3,2,16,20,3,12],
                 'Type': ['A','B','B','B','A','B','B','A','A','B'],
                 })

If I wanted to summarize the I can manually make another dataframe:

p=df.groupby('Players')

summary = pd.DataFrame({'Total Games': p.Players.count(),
                        'Average Wins':p.Wins.mean(),
                        'Greatest Wins':p.Wins.max(),
                        'Unique games':p.Type.nunique()})

Lets say I want to automate this summarizing process to create a dataframe if column X is present perform summarization Y what would the best way of doing this? I have attempted using a dictionary but I think I did something wrong

p=df.groupby('Players')
sumdict = {'Total Games': ['Players', p.Players.count()],
            'Average Wins':['Wins', p.Wins.mean()],
            'Greatest Wins':['Wins', p.Wins.max()],
            'Unique games':['Type', p.Type.nunique()],
            'Max Score':['Score', p.Score.max()]}

summary=pd.DataFrame()

for key, value in sumdict.items():
        if value[0] in df.columns:
            data = pd.DataFrame({key: value[1],})
            summary=summary.append(data)
        else:
            continue

Answer 1

Pandas DataFrame s support most of the dict methods, include get (which allows you to substitute a value for an empty key). So you can do the statistics you want on all columns, then get the values for the column you want, substituting an empty Series for missing columns, then drop NaN columns (I use Bad Value to demonstrate what happens to missing columns):

eser = pd.Series()
count = p.count().max(axis=1)
all_max = p.max()
score_max = all_max.get('Score', eser)
wins_max = all_max.get('Wins', eser)
wins_mean = p.mean().get('Wins', eser)
type_nunique = p.agg(lambda x: x.nunique()).get('Type', eser)

summary = pd.DataFrame({'Total Games': count,
                        'Average Wins': wins_mean,
                        'Greatest Wins': wins_max,
                        'Unique games': type_nunique,
                        'Max Score': score_max})
summary.dropna(axis=1, how='all', inplace=True)

Or a one-liner (which involves calculating the max on all columns twice, which shouldn't be a problem for a small number of values):

summary = pd.DataFrame({'Total Games': p.count().max(axis=1),
                        'Average Wins': p.mean().get('Wins', pd.Series()),
                        'Greatest Wins': p.max().get('Wins', pd.Series()),
                        'Unique games': p.agg(lambda x: x.nunique()).get('Type', pd.Series()),
                        'Max Score': p.max().get('Score', pd.Series())}).dropna(axis=1, how='all')

The result of either approach:

       Average Wins  Greatest Wins  Total Games  Unique games
Greg             11             30            4             2
Sam              15             20            2             2
Steve            11             20            4             2

Without the dropna :

       Average Wins  Greatest Wins  Max Score  Total Games  Unique games
Greg             11             30        NaN            4             2
Sam              15             20        NaN            2             2
Steve            11             20        NaN            4             2

If performance is an issue, the above will be slow because they require computing multiple statistics on all columns, which means statistics are being computed and then thrown away. A faster, but uglier, approach is similar to your approach of using a loop over a dict .

The problem with your implementation is that the dict items are not lazily evaluated, they are evaluated when the dict is created, which means it still tries to access the non-existent columns.

The approach below gets both the item and applies the function only when the column is found (with special handling for the count case, since any existing column will work):

sumdict = {'Total Games': (None, 'count'),
           'Average Wins': ('Wins', 'mean'),
           'Greatest Wins': ('Wins', 'max'),
           'Unique games': ('Type', 'nunique'),
           'Max Score': ('Score', 'max')}

summary = []
for key, (column, op) in sumdict.items():
    if column is None:
        res = p.agg(op).max(axis=1)
    elif column not in df:
        continue
    else: 
        res = p[column].agg(lambda x: getattr(x, op)())
    summary.append(pd.DataFrame({key: res}))
summary = pd.concat(summary, axis=1)

It gives the same result as my above approaches, although with a different column order.

Pandas Generating dataframe based on columns being present

Question

1 answers

solution1
1 ACCPTED 2015-04-30 12:06:47

Pandas Generating dataframe based on columns being present

Question

1 answers

solution1 1 ACCPTED 2015-04-30 12:06:47

solution1
1 ACCPTED 2015-04-30 12:06:47