简体   繁体   中英

Fill missing combinations with "ones" in a groupby object

I have data in the following format.

date        group   ret
1986-01-31  1       1.3
1986-01-31  1       0.9
1986-01-31  2       1.4
1986-01-31  2       1.6
1986-01-31  2       1.5
1986-01-31  3       1.1
1986-02-28  2       1.3
1986-02-28  2       1.1

I want to get the average return per date and group, which I get by doing:

output = df.groupby(['date', 'group'])['ret'].mean() + 1 
output = output.reset_index()

Which gives the following output:

date        group   ret
1986-01-31  1       1.1
1986-01-31  2       1.5
1986-01-31  3       1.1
1986-02-28  2       1.2

However, since no "ret" was given at date 1986-02-28 for class 1 and 3, there is no row in the output for class 1 and 3 on this date. What I would like is that for any combination of dates and classes where no return is given in the original dataframe, this combination gets and output of "1" in the output. So, the required output is:

date        group   ret
1986-01-31  1       1.1
1986-01-31  2       1.5
1986-01-31  3       1.1
1986-02-28  1       1
1986-02-28  2       1.2
1986-02-28  3       1

What would be a good solution for this problem? Thanks in advance!

We can do pivot_table then stack

out = df.pivot_table(index='date',columns='group',values='ret',aggfunc = 'mean').fillna(1).stack().reset_index(name='value')
         date  group  value
0  1986-01-31      1    1.1
1  1986-01-31      2    1.5
2  1986-01-31      3    1.1
3  1986-02-28      1    1.0
4  1986-02-28      2    1.2
5  1986-02-28      3    1.0

You can reindex the result of the groupby and mean and fill the null values with ones:

output = df.groupby(['date', 'group'])['ret'].mean().reindex(
    pd.MultiIndex.from_product(
        (pd.date_range(df.date.min(), df.date.max(), freq='M'),
         sorted(df.group.unique())),
        names=['date', 'group'],
    )
).fillna(1).reset_index()

Here the result for the DataFrame in your question:

        date  group  ret
0 1986-01-31      1  1.1
1 1986-01-31      2  1.5
2 1986-01-31      3  1.1
3 1986-02-28      1  1.0
4 1986-02-28      2  1.2
5 1986-02-28      3  1.0

You could use the complete function from pyjanitor to expose the explicitly missing values, and fillna with 1 :

# pip install pyjanitor
import janitor
(df.groupby(['date', 'group'], as_index = False)
   .ret
   .mean()
   .complete(['date', 'group'])
   .fillna(1)
 )

         date  group  ret
0  1986-01-31      1  1.1
1  1986-01-31      2  1.5
2  1986-01-31      3  1.1
3  1986-02-28      1  1.0
4  1986-02-28      2  1.2
5  1986-02-28      3  1.0

Alternatively, you could convert the group column to a categorical dtype , all categories will be maintained during the groupby:

from pandas.api.types import CategoricalDtype
(df
 .astype({"group": CategoricalDtype(categories=df.group.unique())})
 .groupby(['date', 'group'], as_index = False)
 .ret
 .mean()
 .fillna(1)
 )

         date group  ret
0  1986-01-31     1  1.1
1  1986-01-31     2  1.5
2  1986-01-31     3  1.1
3  1986-02-28     1  1.0
4  1986-02-28     2  1.2
5  1986-02-28     3  1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM