[英]Pandas groupby then fill missing rows
我有一個這樣結構的數據框:
df_all:
day_time LCLid energy(kWh/hh)
2014-02-08 23:00:00 MAC000006 0.077
2014-02-08 23:30:00 MAC000006 0.079
...
2014-02-08 23:00:00 MAC000007 0.045
...
我想用先前和尾隨值填充的數據中缺少四個連續的日期時間(跨所有 LCLid)。
如果數據幀被拆分為子數據幀 (df),每個 LCLid 一個,例如:
gb = df.groupby('LCLid')
df_list = [gb.get_group(x) for x in gb.groups]
然后我可以為 df_list 中的每個 df 執行此操作:
#valid data before gap
prev_row = df.loc['2013-09-09 22:30:00'].copy()
#valid data after gap
post_row = df.loc['2013-09-10 01:00:00'].copy()
df.loc[pd.to_datetime('2013-09-09 23:00:00')] = prev_row
df.loc[pd.to_datetime('2013-09-09 23:30:00')] = prev_row
df.loc[pd.to_datetime('2013-09-10 00:00:00')] = post_row
df.loc[pd.to_datetime('2013-09-10 00:30:00')] = post_row
df = df.sort_index()
我怎樣才能在 df_all 上做到這一點,一次又一次地用來自每個 LCLid 的“有效”數據填充缺失的數據?
輸入數據幀:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
你需要做什么:
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
df = (
df
.groupby('LCLid', as_index=False)
.apply(lambda group: group.reindex(full_idx, method='nearest'))
.reset_index(level=0, drop=True)
.sort_index()
)
結果:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
import numpy as np
import pandas as pd
# Building an example DataFrame that looks like yours
df = pd.DataFrame({
'day_time': [
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 30),
pd.Timestamp(2014, 1, 1, 3, 30),
],
'LCLid': [
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
],
'energy(kWh/hh)': np.random.rand(8)
},
).set_index('day_time')
結果:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
請注意我們如何缺少以下時間戳:
2014-01-01 01:00:00
2014-01-01 01:30:00
2014-01-02 02:00:00
2014-01-02 02:30:00
首先要知道的是df.reindex()
允許您填充缺失的索引值,並且默認為NaN
缺失值。 在您的情況下,您可能希望提供完整的時間戳范圍索引,包括未顯示在起始 DataFrame 中的值。
在這里,我使用pd.date_range()
列出最小和最大起始索引值之間的所有時間戳,步長為 30 分鍾。 警告:這樣做意味着如果您丟失的時間戳值在開頭或結尾,則不會將它們添加回來! 所以也許你想明確指定start
和end
。
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
結果:
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:30:00',
'2014-01-01 01:00:00', '2014-01-01 01:30:00',
'2014-01-01 02:00:00', '2014-01-01 02:30:00',
'2014-01-01 03:00:00', '2014-01-01 03:30:00'],
dtype='datetime64[ns]', freq='30T')
現在,如果我們使用它來重新索引您分組的子數據幀之一,我們將得到:
grouped_df = df[df.LCLid == 'MAC000006']
grouped_df.reindex(full_idx)
結果:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 NaN NaN
2014-01-01 01:30:00 NaN NaN
2014-01-01 02:00:00 NaN NaN
2014-01-01 02:30:00 NaN NaN
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
您說過要使用最接近的可用周圍值來填充缺失值。 這可以在重新索引期間完成,如下所示:
grouped_df.reindex(full_idx, method='nearest')
結果:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
現在我們想將此轉換應用於 DataFrame 中的每個組,其中一個組由其LCLid
定義。
(
df
.groupby('LCLid', as_index=False) # use LCLid as groupby key, but don't add it as a group index
.apply(lambda group: group.reindex(full_idx, method='nearest')) # do this for each group
.reset_index(level=0, drop=True) # get rid of the automatic index generated during groupby
.sort_index() # This is optional, just in case you want timestamps in chronological order
)
結果:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
相關文檔:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply .html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index .html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.