简体   繁体   English

Pandas groupby 然后填充缺失的行

[英]Pandas groupby then fill missing rows

I have a dataframe structured like this:我有一个这样结构的数据框:

df_all: df_all:

      day_time            LCLid       energy(kWh/hh)
2014-02-08 23:00:00     MAC000006         0.077
2014-02-08 23:30:00     MAC000006         0.079
        ...
2014-02-08 23:00:00     MAC000007         0.045
        ...

There are four sequential datetimes (accross all LCLid's) missing from the data that I want to fill with previous and trailing values.我想用先前和尾随值填充的数据中缺少四个连续的日期时间(跨所有 LCLid)。

If the dataframe was split into sub-dataframes (df), one per LCLid eg as per:如果数据帧被拆分为子数据帧 (df),每个 LCLid 一个,例如:

gb = df.groupby('LCLid')    
df_list = [gb.get_group(x) for x in gb.groups]

Then I could do this for each df in df_list:然后我可以为 df_list 中的每个 df 执行此操作:

#valid data before gap
prev_row = df.loc['2013-09-09 22:30:00'].copy()
#valid data after gap
post_row = df.loc['2013-09-10 01:00:00'].copy()
df.loc[pd.to_datetime('2013-09-09 23:00:00')] = prev_row
df.loc[pd.to_datetime('2013-09-09 23:30:00')] = prev_row
df.loc[pd.to_datetime('2013-09-10 00:00:00')] = post_row
df.loc[pd.to_datetime('2013-09-10 00:30:00')] = post_row

df = df.sort_index()

How can I do this on the df_all one one go to fill the missing data with 'valid' data just from each LCLid?我怎样才能在 df_all 上做到这一点,一次又一次地用来自每个 LCLid 的“有效”数据填充缺失的数据?

The solution解决方案

The input DataFrame:输入数据帧:

                         LCLid  energy(kWh/hh)
day_time                                      
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:00:00  MAC000007        0.170603
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 00:30:00  MAC000007        0.276678
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000007        0.027490
2014-01-01 03:30:00  MAC000006        0.688879
2014-01-01 03:30:00  MAC000007        0.868017

What you need to do:你需要做什么:

full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
df = (
    df
    .groupby('LCLid', as_index=False)  
    .apply(lambda group: group.reindex(full_idx, method='nearest'))  
    .reset_index(level=0, drop=True)  
    .sort_index()  
)

Result:结果:

                         LCLid  energy(kWh/hh)
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:00:00  MAC000007        0.170603
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 00:30:00  MAC000007        0.276678
2014-01-01 01:00:00  MAC000006        0.716418
2014-01-01 01:00:00  MAC000007        0.276678
2014-01-01 01:30:00  MAC000006        0.716418
2014-01-01 01:30:00  MAC000007        0.276678
2014-01-01 02:00:00  MAC000006        0.819146
2014-01-01 02:00:00  MAC000007        0.027490
2014-01-01 02:30:00  MAC000006        0.819146
2014-01-01 02:30:00  MAC000007        0.027490
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000007        0.027490
2014-01-01 03:30:00  MAC000006        0.688879
2014-01-01 03:30:00  MAC000007        0.868017

The explanation说明

First I'll build an example DataFrame that looks like yours首先,我将构建一个看起来像您的示例 DataFrame

import numpy as np
import pandas as pd


# Building an example DataFrame that looks like yours
df = pd.DataFrame({
    'day_time': [
           pd.Timestamp(2014, 1, 1, 0, 0),
           pd.Timestamp(2014, 1, 1, 0, 0),
           pd.Timestamp(2014, 1, 1, 0, 30),
           pd.Timestamp(2014, 1, 1, 0, 30),
           pd.Timestamp(2014, 1, 1, 3, 0),
           pd.Timestamp(2014, 1, 1, 3, 0),
           pd.Timestamp(2014, 1, 1, 3, 30),
           pd.Timestamp(2014, 1, 1, 3, 30),
        ],
        'LCLid': [
            'MAC000006',
            'MAC000007',
            'MAC000006',
            'MAC000007',
            'MAC000006',
            'MAC000007',
            'MAC000006',
            'MAC000007',
        ],
        'energy(kWh/hh)': np.random.rand(8)
    },
).set_index('day_time')

Result:结果:

                         LCLid  energy(kWh/hh)
day_time
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:00:00  MAC000007        0.170603
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 00:30:00  MAC000007        0.276678
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000007        0.027490
2014-01-01 03:30:00  MAC000006        0.688879
2014-01-01 03:30:00  MAC000007        0.868017

Notice how we're missing the following timestamps:请注意我们如何缺少以下时间戳:

2014-01-01 01:00:00
2014-01-01 01:30:00
2014-01-02 02:00:00
2014-01-02 02:30:00

df.reindex() df.reindex()

First thing to know is that df.reindex() allows you to fill in missing index values, and will default to NaN for missing values.首先要知道的是df.reindex()允许您填充缺失的索引值,并且默认为NaN缺失值。 In your case, you would want to supply the full timestamp range index, including the values that don't show up in your starting DataFrame.在您的情况下,您可能希望提供完整的时间戳范围索引,包括未显示在起始 DataFrame 中的值。

Here I used pd.date_range() to list all timestamps between your min and max starting index values, taking strides of 30 minutes.在这里,我使用pd.date_range()列出最小和最大起始索引值之间的所有时间戳,步长为 30 分钟。 WARNING : this way of doing it means that if your missing timestamp values are at the beginning or the end, you're not adding them back!警告:这样做意味着如果您丢失的时间戳值在开头或结尾,则不会将它们添加回来! So maybe you want to specify start and end explicitly.所以也许你想明确指定startend

full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')

Result:结果:

DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:30:00',
               '2014-01-01 01:00:00', '2014-01-01 01:30:00',
               '2014-01-01 02:00:00', '2014-01-01 02:30:00',
               '2014-01-01 03:00:00', '2014-01-01 03:30:00'],
              dtype='datetime64[ns]', freq='30T')

Now if we use that to reindex one of your grouped sub-DataFrames, we would get this:现在,如果我们使用它来重新索引您分组的子数据帧之一,我们将得到:

grouped_df = df[df.LCLid == 'MAC000006']
grouped_df.reindex(full_idx)

Result:结果:

                         LCLid  energy(kWh/hh)
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 01:00:00        NaN             NaN
2014-01-01 01:30:00        NaN             NaN
2014-01-01 02:00:00        NaN             NaN
2014-01-01 02:30:00        NaN             NaN
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:30:00  MAC000006        0.688879

You said you want to fill missing values using the closest available surrounding value.您说过要使用最接近的可用周围值来填充缺失值。 This can be done during reindexing, as follows:这可以在重新索引期间完成,如下所示:

grouped_df.reindex(full_idx, method='nearest')

Result:结果:

                         LCLid  energy(kWh/hh)
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 01:00:00  MAC000006        0.716418
2014-01-01 01:30:00  MAC000006        0.716418
2014-01-01 02:00:00  MAC000006        0.819146
2014-01-01 02:30:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:30:00  MAC000006        0.688879

Doing all the groups at once using df.groupby()使用 df.groupby() 一次性完成所有组

Now we'd like to apply this transformation to every group in your DataFrame, where a group is defined by its LCLid .现在我们想将此转换应用于 DataFrame 中的每个组,其中一个组由其LCLid定义。

(
    df
    .groupby('LCLid', as_index=False)  # use LCLid as groupby key, but don't add it as a group index
    .apply(lambda group: group.reindex(full_idx, method='nearest'))  # do this for each group
    .reset_index(level=0, drop=True)  # get rid of the automatic index generated during groupby
    .sort_index()  # This is optional, just in case you want timestamps in chronological order
)

Result:结果:

                         LCLid  energy(kWh/hh)
2014-01-01 00:00:00  MAC000006        0.270453
2014-01-01 00:00:00  MAC000007        0.170603
2014-01-01 00:30:00  MAC000006        0.716418
2014-01-01 00:30:00  MAC000007        0.276678
2014-01-01 01:00:00  MAC000006        0.716418
2014-01-01 01:00:00  MAC000007        0.276678
2014-01-01 01:30:00  MAC000006        0.716418
2014-01-01 01:30:00  MAC000007        0.276678
2014-01-01 02:00:00  MAC000006        0.819146
2014-01-01 02:00:00  MAC000007        0.027490
2014-01-01 02:30:00  MAC000006        0.819146
2014-01-01 02:30:00  MAC000007        0.027490
2014-01-01 03:00:00  MAC000006        0.819146
2014-01-01 03:00:00  MAC000007        0.027490
2014-01-01 03:30:00  MAC000006        0.688879
2014-01-01 03:30:00  MAC000007        0.868017

Relevant doc:相关文档:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply .html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index .html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM