简体   繁体   中英

Combining groupby and apply in multiIndex DataFrames

I am working with a multiIndex DataFrame and want to do some groupby / apply() operations. I am struggling with how to combine groupby and apply.

I would like to extract the values of two indices of my DataFrame and compare those values in an apply function.

For those occurrences where the apply function is true, I would like to do a groupby / sum over the values of my DataFrame.

Is there a good way to do this without using for loops?


 # Index specifier
ix = pd.MultiIndex.from_product(
    [['2015', '2016', '2017', '2018'],
     ['2016', '2017', '2018', '2019', '2020'],
     ['A', 'B', 'C']],
    names=['SimulationStart', 'ProjectionPeriod', 'Group']
)

df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

# Calculate sum over all projection periods for each simulation/group
all_periods = df.groupby(level=['SimulationStart', 'Group']).sum()

# This part of the code is not working yet
# is there a way to extract data from the indices of the DataFrame?
# Calculate sum over all projection periods for each simulation/group;
# where projection period is a maximum of one year in the future
one_year_ahead = df.groupby(level=['SimulationStart', 'Group']) \
                   .apply(lambda x: x['ProjectionPeriod'] - \
                                    x['SimulationStart'] <= 1).sum()

You could calculate the difference, ProjectionPeriod - SimulationStart , before performing the groupby/sum operation.

get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()

import numpy as np
import pandas as pd
ix = pd.MultiIndex.from_product(
    [[2015, 2016, 2017, 2018], 
     [2016, 2017, 2018, 2019, 2020], ['A', 'B', 'C']],
    names=['SimulationStart', 'ProjectionPeriod', 'Group'])
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
print(one_year_ahead)

yields

                          Input
SimulationStart Group          
2015            A      0.821851
                B     -0.643342
                C     -0.140112
2016            A      0.384885
                B     -0.252186
                C     -1.057493
2017            A     -1.055933
                B      1.096221
                C     -4.150002
2018            A      0.584859
                B     -4.062078
                C      1.225105

Here is one way to do it.

df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
    .groupby(['SimulationStart', 'Group']).Input.sum()

SimulationStart  Group
2015             A        1.100246
                 B       -0.605710
                 C        1.366465
2016             A        0.359406
                 B       -2.077444
                 C       -0.004356
2017             A        0.604497
                 B       -0.362941
                 C        0.103945
2018             A       -0.861976
                 B       -0.737274
                 C        0.237512
Name: Input, dtype: float64

Because you have unique values in the Group column, this also works but I don't believe its what you want.

df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
    [['SimulationStart', 'Group', 'Input']]

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM