简体   繁体   中英

Pandas groupby then fill missing rows

I have a dataframe structured like this:

df_all:

      day_time            LCLid       energy(kWh/hh)
2014-02-08 23:00:00     MAC000006         0.077
2014-02-08 23:30:00     MAC000006         0.079
        ...
2014-02-08 23:00:00     MAC000007         0.045
        ...

There are four sequential datetimes (accross all LCLid's) missing from the data that I want to fill with previous and trailing values.

If the dataframe was split into sub-dataframes (df), one per LCLid eg as per:

gb = df.groupby('LCLid')    
df_list = [gb.get_group(x) for x in gb.groups]

Then I could do this for each df in df_list:

#valid data before gap
prev_row = df.loc['2013-09-09 22:30:00'].copy()
#valid data after gap
post_row = df.loc['2013-09-10 01:00:00'].copy()
df.loc[pd.to_datetime('2013-09-09 23:00:00')] = prev_row
df.loc[pd.to_datetime('2013-09-09 23:30:00')] = prev_row
df.loc[pd.to_datetime('2013-09-10 00:00:00')] = post_row
df.loc[pd.to_datetime('2013-09-10 00:30:00')] = post_row

df = df.sort_index()

How can I do this on the df_all one one go to fill the missing data with 'valid' data just from each LCLid?

The solution

The input DataFrame:

                         LCLid  energy(kWh/hh)
day_time                                      
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:00:00  MAC000007        0.170603
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 00:30:00  MAC000007        0.276678
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000007        0.027490
2014-01-01 03:30:00  MAC000006        0.688879
2014-01-01 03:30:00  MAC000007        0.868017

What you need to do:

full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
df = (
    df
    .groupby('LCLid', as_index=False)  
    .apply(lambda group: group.reindex(full_idx, method='nearest'))  
    .reset_index(level=0, drop=True)  
    .sort_index()  
)

Result:

                         LCLid  energy(kWh/hh)
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:00:00  MAC000007        0.170603
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 00:30:00  MAC000007        0.276678
2014-01-01 01:00:00  MAC000006        0.716418
2014-01-01 01:00:00  MAC000007        0.276678
2014-01-01 01:30:00  MAC000006        0.716418
2014-01-01 01:30:00  MAC000007        0.276678
2014-01-01 02:00:00  MAC000006        0.819146
2014-01-01 02:00:00  MAC000007        0.027490
2014-01-01 02:30:00  MAC000006        0.819146
2014-01-01 02:30:00  MAC000007        0.027490
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000007        0.027490
2014-01-01 03:30:00  MAC000006        0.688879
2014-01-01 03:30:00  MAC000007        0.868017

The explanation

First I'll build an example DataFrame that looks like yours

import numpy as np
import pandas as pd


# Building an example DataFrame that looks like yours
df = pd.DataFrame({
    'day_time': [
           pd.Timestamp(2014, 1, 1, 0, 0),
           pd.Timestamp(2014, 1, 1, 0, 0),
           pd.Timestamp(2014, 1, 1, 0, 30),
           pd.Timestamp(2014, 1, 1, 0, 30),
           pd.Timestamp(2014, 1, 1, 3, 0),
           pd.Timestamp(2014, 1, 1, 3, 0),
           pd.Timestamp(2014, 1, 1, 3, 30),
           pd.Timestamp(2014, 1, 1, 3, 30),
        ],
        'LCLid': [
            'MAC000006',
            'MAC000007',
            'MAC000006',
            'MAC000007',
            'MAC000006',
            'MAC000007',
            'MAC000006',
            'MAC000007',
        ],
        'energy(kWh/hh)': np.random.rand(8)
    },
).set_index('day_time')

Result:

                         LCLid  energy(kWh/hh)
day_time
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:00:00  MAC000007        0.170603
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 00:30:00  MAC000007        0.276678
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000007        0.027490
2014-01-01 03:30:00  MAC000006        0.688879
2014-01-01 03:30:00  MAC000007        0.868017

Notice how we're missing the following timestamps:

2014-01-01 01:00:00
2014-01-01 01:30:00
2014-01-02 02:00:00
2014-01-02 02:30:00

df.reindex()

First thing to know is that df.reindex() allows you to fill in missing index values, and will default to NaN for missing values. In your case, you would want to supply the full timestamp range index, including the values that don't show up in your starting DataFrame.

Here I used pd.date_range() to list all timestamps between your min and max starting index values, taking strides of 30 minutes. WARNING : this way of doing it means that if your missing timestamp values are at the beginning or the end, you're not adding them back! So maybe you want to specify start and end explicitly.

full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')

Result:

DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:30:00',
               '2014-01-01 01:00:00', '2014-01-01 01:30:00',
               '2014-01-01 02:00:00', '2014-01-01 02:30:00',
               '2014-01-01 03:00:00', '2014-01-01 03:30:00'],
              dtype='datetime64[ns]', freq='30T')

Now if we use that to reindex one of your grouped sub-DataFrames, we would get this:

grouped_df = df[df.LCLid == 'MAC000006']
grouped_df.reindex(full_idx)

Result:

                         LCLid  energy(kWh/hh)
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 01:00:00        NaN             NaN
2014-01-01 01:30:00        NaN             NaN
2014-01-01 02:00:00        NaN             NaN
2014-01-01 02:30:00        NaN             NaN
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:30:00  MAC000006        0.688879

You said you want to fill missing values using the closest available surrounding value. This can be done during reindexing, as follows:

grouped_df.reindex(full_idx, method='nearest')

Result:

                         LCLid  energy(kWh/hh)
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 01:00:00  MAC000006        0.716418
2014-01-01 01:30:00  MAC000006        0.716418
2014-01-01 02:00:00  MAC000006        0.819146
2014-01-01 02:30:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:30:00  MAC000006        0.688879

Doing all the groups at once using df.groupby()

Now we'd like to apply this transformation to every group in your DataFrame, where a group is defined by its LCLid .

(
    df
    .groupby('LCLid', as_index=False)  # use LCLid as groupby key, but don't add it as a group index
    .apply(lambda group: group.reindex(full_idx, method='nearest'))  # do this for each group
    .reset_index(level=0, drop=True)  # get rid of the automatic index generated during groupby
    .sort_index()  # This is optional, just in case you want timestamps in chronological order
)

Result:

                         LCLid  energy(kWh/hh)
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:00:00  MAC000007        0.170603
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 00:30:00  MAC000007        0.276678
2014-01-01 01:00:00  MAC000006        0.716418
2014-01-01 01:00:00  MAC000007        0.276678
2014-01-01 01:30:00  MAC000006        0.716418
2014-01-01 01:30:00  MAC000007        0.276678
2014-01-01 02:00:00  MAC000006        0.819146
2014-01-01 02:00:00  MAC000007        0.027490
2014-01-01 02:30:00  MAC000006        0.819146
2014-01-01 02:30:00  MAC000007        0.027490
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000007        0.027490
2014-01-01 03:30:00  MAC000006        0.688879
2014-01-01 03:30:00  MAC000007        0.868017

Relevant doc:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM