简体   繁体   English

用多个系列填写 pandas dataframe 的缺失日期

[英]Fill in missing dates for a pandas dataframe with multiple series

I have a dataframe that contains multiple time series, like so:我有一个包含多个时间序列的 dataframe,如下所示:

Date日期 Item物品 Category类别
2021-01-01 2021-01-01 gouda豪达 cheese起司
2021-01-02 2021-01-02 gouda豪达 cheese起司
2021-01-04 2021-01-04 gouda豪达 cheese起司
2021-01-05 2021-01-05 gouda豪达 cheese起司
2021-02-01 2021-02-01 lettuce莴苣 produce生产
2021-02-02 2021-02-02 lettuce莴苣 produce生产
2021-02-03 2021-02-03 lettuce莴苣 produce生产
2021-02-05 2021-02-05 lettuce莴苣 produce生产

I'd like to add rows for the missing dates (ex. 2021-01-03 for gouda, 2021-02-04 for lettuce).我想为缺失的日期添加行(例如 2021-01-03 豪达,2021-02-04 生菜)。 Note that these series do not necessarily start and end on the same date.请注意,这些系列不一定在同一天开始和结束。

What is the best way to do this in pandas?在 pandas 中执行此操作的最佳方法是什么? I'd also like fill the new rows with the values in the "item" and "category" column for that series.我还想用该系列的“项目”和“类别”列中的值填充新行。

Group by Item and Category , then generate a time series from the min to the max date:ItemCategory分组,然后生成从最小到最大日期的时间序列:

result = (
    df.groupby(["Item", "Category"])["Date"]
    .apply(lambda s: pd.date_range(s.min(), s.max()))
    .explode()
    .reset_index()
)

You can do resample你可以做resample

df['Date'] = pd.to_datetime(df['Date'])
df['Y-m'] = df['Date'].dt.strftime('%y-%m')
out = df.groupby('Y-m').apply(lambda x : x.set_index('Date').resample('D').ffill()).reset_index(level=1)

This is far from optimal, but it is how I would do in order to ensure all categories and items are within the min and max periods, and all ranges are filled:这远非最佳,但我会这样做,以确保所有类别和项目都在最小和最大周期内,并且所有范围都已填充:

aux = []
for x in df['Item'].unique():
  _ = pd.DataFrame({'Date':pd.date_range(df[df['Item']==x]['Date'].min(),df[df['Item']==x]['Date'].max(),freq='d')})
  _['Item'] = x
  _['Category'] = df[df['Item']==x]['Category'].values[0]
  aux.append(_)
output = pd.concat(aux)  

Consider this sample dataset:考虑这个示例数据集:

df = pd.DataFrame({'Date':['2021-01-01','2021-01-02','2021-01-04','2021-01-05','2021-01-01','2021-01-02','2021-01-04','2021-01-05'],
                   'Item':['gouda','gouda','gouda','gouda','lettuce','lettuce','lettuce','lettuce'],
                   'Category':['cheese','cheese','cheese','cheese','produce','produce','produce','produce']})
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)

Outputs:输出:

        Date     Item Category
0 2021-01-01    gouda   cheese
1 2021-01-02    gouda   cheese
2 2021-01-03    gouda   cheese
3 2021-01-04    gouda   cheese
4 2021-01-05    gouda   cheese
0 2021-01-01  lettuce  produce
1 2021-01-02  lettuce  produce
2 2021-01-03  lettuce  produce
3 2021-01-04  lettuce  produce
4 2021-01-05  lettuce  produce

One option is with the complete function from pyjanitor to explicitly generate missing rows:一种选择是使用来自pyjanitor完整function 显式生成缺失的行:

# pip install pyjanitor
import pandas as pd
import janitor

df.complete(
    {'Date': lambda date: pd.date_range(date.min(), date.max())}, 
    by = ['Item', 'Category'], 
    sort = True)
 
        Date     Item Category
0 2021-01-01    gouda   cheese
1 2021-01-02    gouda   cheese
2 2021-01-03    gouda   cheese
3 2021-01-04    gouda   cheese
4 2021-01-05    gouda   cheese
5 2021-02-01  lettuce  produce
6 2021-02-02  lettuce  produce
7 2021-02-03  lettuce  produce
8 2021-02-04  lettuce  produce
9 2021-02-05  lettuce  produce

The dictionary helps introduce values into the dataframe. The key of the dataframe should be an existing column;字典帮助将值引入dataframe。dataframe的键应该是现有的列; the lambda function refers to the Date column lambda function 指的是Date

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM