用多个系列填写 pandas dataframe 的缺失日期

Question

I have a dataframe that contains multiple time series, like so:我有一个包含多个时间序列的 dataframe，如下所示：

Date日期	Item物品	Category类别
2021-01-01 2021-01-01	gouda豪达	cheese起司
2021-01-02 2021-01-02	gouda豪达	cheese起司
2021-01-04 2021-01-04	gouda豪达	cheese起司
2021-01-05 2021-01-05	gouda豪达	cheese起司
2021-02-01 2021-02-01	lettuce莴苣	produce生产
2021-02-02 2021-02-02	lettuce莴苣	produce生产
2021-02-03 2021-02-03	lettuce莴苣	produce生产
2021-02-05 2021-02-05	lettuce莴苣	produce生产

I'd like to add rows for the missing dates (ex. 2021-01-03 for gouda, 2021-02-04 for lettuce).我想为缺失的日期添加行（例如 2021-01-03 豪达，2021-02-04 生菜）。 Note that these series do not necessarily start and end on the same date.请注意，这些系列不一定在同一天开始和结束。

What is the best way to do this in pandas?在 pandas 中执行此操作的最佳方法是什么？ I'd also like fill the new rows with the values in the "item" and "category" column for that series.我还想用该系列的“项目”和“类别”列中的值填充新行。

Answer 1

Group by Item and Category , then generate a time series from the min to the max date:按Item和Category分组，然后生成从最小到最大日期的时间序列：

result = (
    df.groupby(["Item", "Category"])["Date"]
    .apply(lambda s: pd.date_range(s.min(), s.max()))
    .explode()
    .reset_index()
)

Answer 2

You can do resample你可以做resample

df['Date'] = pd.to_datetime(df['Date'])
df['Y-m'] = df['Date'].dt.strftime('%y-%m')
out = df.groupby('Y-m').apply(lambda x : x.set_index('Date').resample('D').ffill()).reset_index(level=1)

Answer 3

This is far from optimal, but it is how I would do in order to ensure all categories and items are within the min and max periods, and all ranges are filled:这远非最佳，但我会这样做，以确保所有类别和项目都在最小和最大周期内，并且所有范围都已填充：

aux = []
for x in df['Item'].unique():
  _ = pd.DataFrame({'Date':pd.date_range(df[df['Item']==x]['Date'].min(),df[df['Item']==x]['Date'].max(),freq='d')})
  _['Item'] = x
  _['Category'] = df[df['Item']==x]['Category'].values[0]
  aux.append(_)
output = pd.concat(aux)

Consider this sample dataset:考虑这个示例数据集：

df = pd.DataFrame({'Date':['2021-01-01','2021-01-02','2021-01-04','2021-01-05','2021-01-01','2021-01-02','2021-01-04','2021-01-05'],
                   'Item':['gouda','gouda','gouda','gouda','lettuce','lettuce','lettuce','lettuce'],
                   'Category':['cheese','cheese','cheese','cheese','produce','produce','produce','produce']})
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)

Outputs:输出：

        Date     Item Category
0 2021-01-01    gouda   cheese
1 2021-01-02    gouda   cheese
2 2021-01-03    gouda   cheese
3 2021-01-04    gouda   cheese
4 2021-01-05    gouda   cheese
0 2021-01-01  lettuce  produce
1 2021-01-02  lettuce  produce
2 2021-01-03  lettuce  produce
3 2021-01-04  lettuce  produce
4 2021-01-05  lettuce  produce

Answer 4

One option is with the complete function from pyjanitor to explicitly generate missing rows:一种选择是使用来自pyjanitor的完整function 显式生成缺失的行：

# pip install pyjanitor
import pandas as pd
import janitor

df.complete(
    {'Date': lambda date: pd.date_range(date.min(), date.max())}, 
    by = ['Item', 'Category'], 
    sort = True)
 
        Date     Item Category
0 2021-01-01    gouda   cheese
1 2021-01-02    gouda   cheese
2 2021-01-03    gouda   cheese
3 2021-01-04    gouda   cheese
4 2021-01-05    gouda   cheese
5 2021-02-01  lettuce  produce
6 2021-02-02  lettuce  produce
7 2021-02-03  lettuce  produce
8 2021-02-04  lettuce  produce
9 2021-02-05  lettuce  produce

The dictionary helps introduce values into the dataframe. The key of the dataframe should be an existing column;字典帮助将值引入dataframe。dataframe的键应该是现有的列； the lambda function refers to the Date column lambda function 指的是Date栏

用多个系列填写 pandas dataframe 的缺失日期

问题描述

4 个解决方案

解决方案1
3 已采纳 2022-04-06 18:17:22

解决方案2
2 2022-04-06 18:18:11

解决方案3
0 2022-04-06 18:17:39

解决方案4
0 2022-04-06 21:30:06

用多个系列填写 pandas dataframe 的缺失日期

问题描述

4 个解决方案

解决方案1 3 已采纳 2022-04-06 18:17:22

解决方案2 2 2022-04-06 18:18:11

解决方案3 0 2022-04-06 18:17:39

解决方案4 0 2022-04-06 21:30:06

解决方案1
3 已采纳 2022-04-06 18:17:22

解决方案2
2 2022-04-06 18:18:11

解决方案3
0 2022-04-06 18:17:39

解决方案4
0 2022-04-06 21:30:06