[英]pandas - Extend Index of a DataFrame setting all columns for new rows to NaN?
我有时间索引数据:
df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2 = df2.set_index('day')
df2
b
day
2012-01-01 0.22
2012-01-03 0.30
扩展此数据框的最佳方法是什么,使其在 2012 年 1 月的每一天都有一行(比如说),其中所有列都设置为NaN
(这里只有b
),我们没有数据?
所以想要的结果是:
b
day
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
...
2012-01-31 NaN
非常感谢!
使用这个(从熊猫 1.1.3 开始):
ix = pd.date_range(start=date(2012, 1, 1), end=date(2012, 1, 31), freq='D')
df2.reindex(ix)
这给出了:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
[...]
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
对于旧版本的熊猫,将pd.date_range
替换为pd.DatetimeIndex
。
您可以重新采样过去的日期作为频率,而不指定fill_method
参数缺失值将根据需要填充NaN
df3 = df2.asfreq('D')
df3
Out[16]:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
回答你的第二部分,我目前想不出更优雅的方式:
df3 = DataFrame({ 'day': Series([date(2012, 1, 4), date(2012, 1, 31)])})
df3.set_index('day',inplace=True)
merged = df2.append(df3)
merged = merged.asfreq('D')
merged
Out[46]:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 NaN
2012-01-11 NaN
2012-01-12 NaN
2012-01-13 NaN
2012-01-14 NaN
2012-01-15 NaN
2012-01-16 NaN
2012-01-17 NaN
2012-01-18 NaN
2012-01-19 NaN
2012-01-20 NaN
2012-01-21 NaN
2012-01-22 NaN
2012-01-23 NaN
2012-01-24 NaN
2012-01-25 NaN
2012-01-26 NaN
2012-01-27 NaN
2012-01-28 NaN
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
这构建了第二个时间序列,然后我们像以前一样追加并调用asfreq('D')
。
这是另一种选择:首先在您想要的最后一天添加NaN
记录,然后重新采样。 这样重采样将为您填补缺失的日期。
起始帧:
import pandas as pd
import numpy as np
from datetime import date
df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2= df2.set_index('day')
df2
Out:
b
day
2012-01-01 0.22
2012-01-03 0.30
填充框架:
df2 = df2.set_value(date(2012,1,31),'b',np.float('nan'))
df2.asfreq('D')
Out:
b
day
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 NaN
2012-01-11 NaN
2012-01-12 NaN
2012-01-13 NaN
2012-01-14 NaN
2012-01-15 NaN
2012-01-16 NaN
2012-01-17 NaN
2012-01-18 NaN
2012-01-19 NaN
2012-01-20 NaN
2012-01-21 NaN
2012-01-22 NaN
2012-01-23 NaN
2012-01-24 NaN
2012-01-25 NaN
2012-01-26 NaN
2012-01-27 NaN
2012-01-28 NaN
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
马克的回答似乎不再适用于熊猫 1.1.1。
但是,使用相同的想法,以下工作:
from datetime import datetime
import pandas as pd
# get start and desired end dates
first_date = df['date'].min()
today = datetime.today()
# set index
df.set_index('date', inplace=True)
# and here is were the magic happens
idx = pd.date_range(first_date, today, freq='D')
df = df.reindex(idx)
编辑:刚刚发现这个确切的用例在文档中:
不完全是问题,因为您知道第二个索引是一月的所有天数,但假设您有另一个索引来自另一个数据框 df1,它可能不相交且频率随机。 然后你可以这样做:
ix = pd.DatetimeIndex(list(df2.index) + list(df1.index)).unique().sort_values()
df2.reindex(ix)
将索引转换为列表允许以自然的方式创建更长的列表。
def extendframe(df, ndays):
"""
(df, ndays) -> df that is padded by ndays in beginning and end
"""
ixd = df.index - datetime.timedelta(ndays)
ixu = df.index + datetime.timedelta(ndays)
ixx = df.index.union(ixd.union(ixu))
df_ = df.reindex(ixx)
return df_
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.