简体   繁体   中英

Including the missing dates in the date column of pandas dataframe for a specific timespan

I have a time series dataframe whose headers are "Date" & "Value"

Sample Dataframe:

    Date              Value
    2019-10-01      46486868.0
    2019-11-01      36092742.0
    2019-12-01      32839185.0

Unfortunately, the Date column has some missing months. For example, the dataset is supposed to be a monthly dataset for the years 2001-2010 but it is missing a few random months in each year.

Since I am making a time series model, I cant have missing dates so I have been trying to figure out a way to include the same.

I tried creating a pandas object as follows:

df_date = pd.date_range(start='1/1/2001', end='11/1/2010', freq='MS')    
type(df_date)

pandas.core.indexes.datetimes.DatetimeIndex

But this is not getting merged with pandas dataframe. It gives the following error:

TypeError: Can only merge Series or DataFrame objects, a <class 'pandas.core.indexes.datetimes.DatetimeIndex'> was passed

The task at hand is to have all dates for each month in the time span by either matching it with another dataframe with all dates and ffil the values that are missing or somehow including the missing values in the existing dataframe.

Is there any way to solve this?

pd.date_range returns an index object. You can create a dateframe with it like this:

df_date = pd.DataFrame({'Date':pd.date_range(start='1/1/2019', end='1/1/2020', freq='MS')    })

and then you can merge

df_date.merge(df, on='Date', how='left')

Output:

         Date       Value
0  2019-01-01         NaN
1  2019-02-01         NaN
2  2019-03-01         NaN
3  2019-04-01         NaN
4  2019-05-01         NaN
5  2019-06-01         NaN
6  2019-07-01         NaN
7  2019-08-01         NaN
8  2019-09-01         NaN
9  2019-10-01  46486868.0
10 2019-11-01  36092742.0
11 2019-12-01  32839185.0
12 2020-01-01         NaN

Set Date as index and reindex it with df_date :

df_date = pd.date_range(start='1/1/2019', end='11/1/2020', freq='MS')
df = df.set_index('Date').reindex(df_date)

Output:

>>> df
                 Value
2019-01-01         NaN
2019-02-01         NaN
2019-03-01         NaN
2019-04-01         NaN
2019-05-01         NaN
2019-06-01         NaN
2019-07-01         NaN
2019-08-01         NaN
2019-09-01         NaN
2019-10-01  46486868.0
2019-11-01  36092742.0
2019-12-01  32839185.0
2020-01-01         NaN
2020-02-01         NaN
2020-03-01         NaN
2020-04-01         NaN
2020-05-01         NaN
2020-06-01         NaN
2020-07-01         NaN
2020-08-01         NaN
2020-09-01         NaN
2020-10-01         NaN
2020-11-01         NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM