简体   繁体   English

pandas - 扩展 DataFrame 的索引将新行的所有列设置为 NaN?

[英]pandas - Extend Index of a DataFrame setting all columns for new rows to NaN?

I have time-indexed data:我有时间索引数据:

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2 = df2.set_index('day')
df2
               b
 day             
2012-01-01  0.22
2012-01-03  0.30

What is the best way to extend this data frame so that it has one row for every day in January 2012 (say), where all columns are set to NaN (here only b ) where we don't have data?扩展此数据框的最佳方法是什么,使其在 2012 年 1 月的每一天都有一行(比如说),其中所有列都设置为NaN (这里只有b ),我们没有数据?

So the desired result would be:所以想要的结果是:

               b
 day             
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
...
2012-01-31   NaN

Many thanks!非常感谢!

Use this (current as of pandas 1.1.3):使用这个(从熊猫 1.1.3 开始):

ix = pd.date_range(start=date(2012, 1, 1), end=date(2012, 1, 31), freq='D')
df2.reindex(ix)

Which gives:这给出了:

               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
2012-01-05   NaN
[...]
2012-01-29   NaN
2012-01-30   NaN
2012-01-31   NaN

For older versions of pandas replace pd.date_range with pd.DatetimeIndex .对于旧版本的熊猫,将pd.date_range替换为pd.DatetimeIndex

You can resample passing day as frequency, without specifying a fill_method parameter missing values will be NaN filled as you desired您可以重新采样过去的日期作为频率,而不指定fill_method参数缺失值将根据需要填充NaN

df3 = df2.asfreq('D')
df3

Out[16]:
               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30

To answer your second part, I can't think of a more elegant way at the moment:回答你的第二部分,我目前想不出更优雅的方式:

df3 = DataFrame({ 'day': Series([date(2012, 1, 4), date(2012, 1, 31)])})
df3.set_index('day',inplace=True)
merged = df2.append(df3)
merged = merged.asfreq('D')
merged


Out[46]:
               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
2012-01-05   NaN
2012-01-06   NaN
2012-01-07   NaN
2012-01-08   NaN
2012-01-09   NaN
2012-01-10   NaN
2012-01-11   NaN
2012-01-12   NaN
2012-01-13   NaN
2012-01-14   NaN
2012-01-15   NaN
2012-01-16   NaN
2012-01-17   NaN
2012-01-18   NaN
2012-01-19   NaN
2012-01-20   NaN
2012-01-21   NaN
2012-01-22   NaN
2012-01-23   NaN
2012-01-24   NaN
2012-01-25   NaN
2012-01-26   NaN
2012-01-27   NaN
2012-01-28   NaN
2012-01-29   NaN
2012-01-30   NaN
2012-01-31   NaN

This constructs a second time series and then we just append and call asfreq('D') as before.这构建了第二个时间序列,然后我们像以前一样追加并调用asfreq('D')

Here's another option: First add a NaN record on the last day you want, then resample.这是另一种选择:首先在您想要的最后一天添加NaN记录,然后重新采样。 This way resampling will fill the missing dates for you.这样重采样将为您填补缺失的日期。

Starting Frame:起始帧:

import pandas as pd
import numpy as np
from datetime import date

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2= df2.set_index('day')
df2

Out:
                  b
    day 
    2012-01-01  0.22
    2012-01-03  0.30

Filled Frame:填充框架:

df2 = df2.set_value(date(2012,1,31),'b',np.float('nan'))
df2.asfreq('D')

Out:
                b
    day 
    2012-01-01  0.22
    2012-01-02  NaN
    2012-01-03  0.30
    2012-01-04  NaN
    2012-01-05  NaN
    2012-01-06  NaN
    2012-01-07  NaN
    2012-01-08  NaN
    2012-01-09  NaN
    2012-01-10  NaN
    2012-01-11  NaN
    2012-01-12  NaN
    2012-01-13  NaN
    2012-01-14  NaN
    2012-01-15  NaN
    2012-01-16  NaN
    2012-01-17  NaN
    2012-01-18  NaN
    2012-01-19  NaN
    2012-01-20  NaN
    2012-01-21  NaN
    2012-01-22  NaN
    2012-01-23  NaN
    2012-01-24  NaN
    2012-01-25  NaN
    2012-01-26  NaN
    2012-01-27  NaN
    2012-01-28  NaN
    2012-01-29  NaN
    2012-01-30  NaN
    2012-01-31  NaN

Mark's answer seems to not be working anymore on pandas 1.1.1.马克的回答似乎不再适用于熊猫 1.1.1。

However, using the same idea, the following works:但是,使用相同的想法,以下工作:

from datetime import datetime
import pandas as pd


# get start and desired end dates
first_date = df['date'].min()
today = datetime.today()

# set index
df.set_index('date', inplace=True)

# and here is were the magic happens
idx = pd.date_range(first_date, today, freq='D')
df = df.reindex(idx)

EDIT: just found out that this exact use case is in the docs:编辑:刚刚发现这个确切的用例在文档中:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex

Not exactly the question since here you know that the second index is all days in January, but suppose you have another index say from another data frame df1, which might be disjoint and with a random frequency.不完全是问题,因为您知道第二个索引是一月的所有天数,但假设您有另一个索引来自另一个数据框 df1,它可能不相交且频率随机。 Then you can do this:然后你可以这样做:

ix = pd.DatetimeIndex(list(df2.index) + list(df1.index)).unique().sort_values()
df2.reindex(ix)

Converting indices to lists allows one to create a longer list in a natural way.将索引转换为列表允许以自然的方式创建更长的列表。

def extendframe(df, ndays):
    """
    (df, ndays) -> df that is padded by ndays in beginning and end
    """
    ixd = df.index - datetime.timedelta(ndays)
    ixu = df.index + datetime.timedelta(ndays)
    ixx = df.index.union(ixd.union(ixu))
    df_ = df.reindex(ixx)
    return df_

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM