简体   繁体   中英

Pandas: Filling data for missing dates

Let's say I've got the following table:

ProdID  Date        Val1 Val2 Val3
Prod1   4/1/2019    1    3    4
Prod1   4/3/2019    2    3    54
Prod1   4/4/2019    3    4    54
Prod2   4/1/2019    1    3    3
Prod2   4/2/2019    1    3    4
Prod2   4/3/2019    2    4    4
Prod2   4/4/2019    2    5    3

Prod2 entries are populated correctly as we've got the data from 4/1/2019 to 4/4/2019 .

Prod1 has 1 missing date - 4/2/2019 .

I would like to find missing dates for all ProdIDs and fill in Val1-3 with data copied from the last of previous entry. For instance, I would like to copy data from 4/1/2019 to 4/2/2019

ProdID  Date        Val1 Val2 Val3
Prod1   4/1/2019    1    3    4
Prod1   4/2/2019    1    3    4
Prod1   4/3/2019    2    3    54
Prod1   4/4/2019    3    4    54
Prod2   4/1/2019    1    3    3
Prod2   4/2/2019    1    3    4
Prod2   4/3/2019    2    4    4
Prod2   4/4/2019    2    5    3

First convert column to datetime s by to_datetime , then create DatetimeIndex by DataFrame.set_index and call GroupBy.apply with DataFrame.asfreq - there is also possible specify method for forward or back filling missing values:

df['Date'] = pd.to_datetime(df['Date'])

df1 = (df.set_index('Date')
         .groupby('ProdID')
         .apply(lambda x: x.asfreq('D', method='ffill'))
         .reset_index(level=0, drop=True)
         .reset_index()
         .reindex(df.columns, axis=1))

print (df1)
  ProdID       Date  Val1  Val2  Val3
0  Prod1 2019-04-01     1     3     4
1  Prod1 2019-04-02     1     3     4
2  Prod1 2019-04-03     2     3    54
3  Prod1 2019-04-04     3     4    54
4  Prod2 2019-04-01     1     3     3
5  Prod2 2019-04-02     1     3     4
6  Prod2 2019-04-03     2     4     4
7  Prod2 2019-04-04     2     5     3

Another solution is create all combinations of product and datetimes by product and DataFrame.merge with left join, last forward filling missing values by ffill :

dates = pd.date_range(start=df['Date'].min(), end=df['Date'].max())
prods = df.ProdID.unique()

from  itertools import product
df1 = pd.DataFrame(list(product(prods, dates)), columns=['ProdID', 'Date'])
print (df1)
  ProdID       Date
0  Prod1 2019-04-01
1  Prod1 2019-04-02
2  Prod1 2019-04-03
3  Prod1 2019-04-04
4  Prod2 2019-04-01
5  Prod2 2019-04-02
6  Prod2 2019-04-03
7  Prod2 2019-04-04

df = df1.merge(df, how='left').ffill()
print (df)
  ProdID       Date  Val1  Val2  Val3
0  Prod1 2019-04-01   1.0   3.0   4.0
1  Prod1 2019-04-02   1.0   3.0   4.0
2  Prod1 2019-04-03   2.0   3.0  54.0
3  Prod1 2019-04-04   3.0   4.0  54.0
4  Prod2 2019-04-01   1.0   3.0   3.0
5  Prod2 2019-04-02   1.0   3.0   4.0
6  Prod2 2019-04-03   2.0   4.0   4.0
7  Prod2 2019-04-04   2.0   5.0   3.0

You can also use pandas.MultiIndex.from_product , DataFrame.reindex and DataFrame.ffill like this:

df['Date'] = pd.to_datetime(df['Date'])
dates = pd.date_range(start=df.Date.min(), end=df.Date.max(), freq='1D')

idx = pd.MultiIndex.from_product([df.ProdID.unique(), dates], names=['ProdID', 'Date'])

df.set_index(['ProdID', 'Date']).reindex(idx).ffill().reset_index()

[output]

  ProdID       Date  Val1  Val2  Val3
0  Prod1 2019-04-01   1.0   3.0   4.0
1  Prod1 2019-04-02   1.0   3.0   4.0
2  Prod1 2019-04-03   2.0   3.0  54.0
3  Prod1 2019-04-04   3.0   4.0  54.0
4  Prod2 2019-04-01   1.0   3.0   3.0
5  Prod2 2019-04-02   1.0   3.0   4.0
6  Prod2 2019-04-03   2.0   4.0   4.0
7  Prod2 2019-04-04   2.0   5.0   3.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM