[英]pandas - Extend Index of a DataFrame setting all columns for new rows to NaN?
我有時間索引數據:
df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2 = df2.set_index('day')
df2
b
day
2012-01-01 0.22
2012-01-03 0.30
擴展此數據框的最佳方法是什么,使其在 2012 年 1 月的每一天都有一行(比如說),其中所有列都設置為NaN
(這里只有b
),我們沒有數據?
所以想要的結果是:
b
day
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
...
2012-01-31 NaN
非常感謝!
使用這個(從熊貓 1.1.3 開始):
ix = pd.date_range(start=date(2012, 1, 1), end=date(2012, 1, 31), freq='D')
df2.reindex(ix)
這給出了:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
[...]
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
對於舊版本的熊貓,將pd.date_range
替換為pd.DatetimeIndex
。
您可以重新采樣過去的日期作為頻率,而不指定fill_method
參數缺失值將根據需要填充NaN
df3 = df2.asfreq('D')
df3
Out[16]:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
回答你的第二部分,我目前想不出更優雅的方式:
df3 = DataFrame({ 'day': Series([date(2012, 1, 4), date(2012, 1, 31)])})
df3.set_index('day',inplace=True)
merged = df2.append(df3)
merged = merged.asfreq('D')
merged
Out[46]:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 NaN
2012-01-11 NaN
2012-01-12 NaN
2012-01-13 NaN
2012-01-14 NaN
2012-01-15 NaN
2012-01-16 NaN
2012-01-17 NaN
2012-01-18 NaN
2012-01-19 NaN
2012-01-20 NaN
2012-01-21 NaN
2012-01-22 NaN
2012-01-23 NaN
2012-01-24 NaN
2012-01-25 NaN
2012-01-26 NaN
2012-01-27 NaN
2012-01-28 NaN
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
這構建了第二個時間序列,然后我們像以前一樣追加並調用asfreq('D')
。
這是另一種選擇:首先在您想要的最后一天添加NaN
記錄,然后重新采樣。 這樣重采樣將為您填補缺失的日期。
起始幀:
import pandas as pd
import numpy as np
from datetime import date
df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2= df2.set_index('day')
df2
Out:
b
day
2012-01-01 0.22
2012-01-03 0.30
填充框架:
df2 = df2.set_value(date(2012,1,31),'b',np.float('nan'))
df2.asfreq('D')
Out:
b
day
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 NaN
2012-01-11 NaN
2012-01-12 NaN
2012-01-13 NaN
2012-01-14 NaN
2012-01-15 NaN
2012-01-16 NaN
2012-01-17 NaN
2012-01-18 NaN
2012-01-19 NaN
2012-01-20 NaN
2012-01-21 NaN
2012-01-22 NaN
2012-01-23 NaN
2012-01-24 NaN
2012-01-25 NaN
2012-01-26 NaN
2012-01-27 NaN
2012-01-28 NaN
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
馬克的回答似乎不再適用於熊貓 1.1.1。
但是,使用相同的想法,以下工作:
from datetime import datetime
import pandas as pd
# get start and desired end dates
first_date = df['date'].min()
today = datetime.today()
# set index
df.set_index('date', inplace=True)
# and here is were the magic happens
idx = pd.date_range(first_date, today, freq='D')
df = df.reindex(idx)
編輯:剛剛發現這個確切的用例在文檔中:
不完全是問題,因為您知道第二個索引是一月的所有天數,但假設您有另一個索引來自另一個數據框 df1,它可能不相交且頻率隨機。 然后你可以這樣做:
ix = pd.DatetimeIndex(list(df2.index) + list(df1.index)).unique().sort_values()
df2.reindex(ix)
將索引轉換為列表允許以自然的方式創建更長的列表。
def extendframe(df, ndays):
"""
(df, ndays) -> df that is padded by ndays in beginning and end
"""
ixd = df.index - datetime.timedelta(ndays)
ixu = df.index + datetime.timedelta(ndays)
ixx = df.index.union(ixd.union(ixu))
df_ = df.reindex(ixx)
return df_
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.