简体   繁体   中英

pandas - Extend Index of a DataFrame setting all columns for new rows to NaN?

I have time-indexed data:

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2 = df2.set_index('day')
df2
               b
 day             
2012-01-01  0.22
2012-01-03  0.30

What is the best way to extend this data frame so that it has one row for every day in January 2012 (say), where all columns are set to NaN (here only b ) where we don't have data?

So the desired result would be:

               b
 day             
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
...
2012-01-31   NaN

Many thanks!

Use this (current as of pandas 1.1.3):

ix = pd.date_range(start=date(2012, 1, 1), end=date(2012, 1, 31), freq='D')
df2.reindex(ix)

Which gives:

               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
2012-01-05   NaN
[...]
2012-01-29   NaN
2012-01-30   NaN
2012-01-31   NaN

For older versions of pandas replace pd.date_range with pd.DatetimeIndex .

You can resample passing day as frequency, without specifying a fill_method parameter missing values will be NaN filled as you desired

df3 = df2.asfreq('D')
df3

Out[16]:
               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30

To answer your second part, I can't think of a more elegant way at the moment:

df3 = DataFrame({ 'day': Series([date(2012, 1, 4), date(2012, 1, 31)])})
df3.set_index('day',inplace=True)
merged = df2.append(df3)
merged = merged.asfreq('D')
merged


Out[46]:
               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
2012-01-05   NaN
2012-01-06   NaN
2012-01-07   NaN
2012-01-08   NaN
2012-01-09   NaN
2012-01-10   NaN
2012-01-11   NaN
2012-01-12   NaN
2012-01-13   NaN
2012-01-14   NaN
2012-01-15   NaN
2012-01-16   NaN
2012-01-17   NaN
2012-01-18   NaN
2012-01-19   NaN
2012-01-20   NaN
2012-01-21   NaN
2012-01-22   NaN
2012-01-23   NaN
2012-01-24   NaN
2012-01-25   NaN
2012-01-26   NaN
2012-01-27   NaN
2012-01-28   NaN
2012-01-29   NaN
2012-01-30   NaN
2012-01-31   NaN

This constructs a second time series and then we just append and call asfreq('D') as before.

Here's another option: First add a NaN record on the last day you want, then resample. This way resampling will fill the missing dates for you.

Starting Frame:

import pandas as pd
import numpy as np
from datetime import date

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2= df2.set_index('day')
df2

Out:
                  b
    day 
    2012-01-01  0.22
    2012-01-03  0.30

Filled Frame:

df2 = df2.set_value(date(2012,1,31),'b',np.float('nan'))
df2.asfreq('D')

Out:
                b
    day 
    2012-01-01  0.22
    2012-01-02  NaN
    2012-01-03  0.30
    2012-01-04  NaN
    2012-01-05  NaN
    2012-01-06  NaN
    2012-01-07  NaN
    2012-01-08  NaN
    2012-01-09  NaN
    2012-01-10  NaN
    2012-01-11  NaN
    2012-01-12  NaN
    2012-01-13  NaN
    2012-01-14  NaN
    2012-01-15  NaN
    2012-01-16  NaN
    2012-01-17  NaN
    2012-01-18  NaN
    2012-01-19  NaN
    2012-01-20  NaN
    2012-01-21  NaN
    2012-01-22  NaN
    2012-01-23  NaN
    2012-01-24  NaN
    2012-01-25  NaN
    2012-01-26  NaN
    2012-01-27  NaN
    2012-01-28  NaN
    2012-01-29  NaN
    2012-01-30  NaN
    2012-01-31  NaN

Mark's answer seems to not be working anymore on pandas 1.1.1.

However, using the same idea, the following works:

from datetime import datetime
import pandas as pd


# get start and desired end dates
first_date = df['date'].min()
today = datetime.today()

# set index
df.set_index('date', inplace=True)

# and here is were the magic happens
idx = pd.date_range(first_date, today, freq='D')
df = df.reindex(idx)

EDIT: just found out that this exact use case is in the docs:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex

Not exactly the question since here you know that the second index is all days in January, but suppose you have another index say from another data frame df1, which might be disjoint and with a random frequency. Then you can do this:

ix = pd.DatetimeIndex(list(df2.index) + list(df1.index)).unique().sort_values()
df2.reindex(ix)

Converting indices to lists allows one to create a longer list in a natural way.

def extendframe(df, ndays):
    """
    (df, ndays) -> df that is padded by ndays in beginning and end
    """
    ixd = df.index - datetime.timedelta(ndays)
    ixu = df.index + datetime.timedelta(ndays)
    ixx = df.index.union(ixd.union(ixu))
    df_ = df.reindex(ixx)
    return df_

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM