简体   繁体   English

Python pandas:在组中插入缺失日期的行,时间序列dataframe

[英]Python pandas: insert rows for missing dates, time series in groupby dataframe

I have a dataframe df :我有一个 dataframe df

   Serial_no       date  Index     x    y
           1 2014-01-01      1   2.0  3.0
           1 2014-03-01      2   3.0  3.0
           1 2014-04-01      3   6.0  2.0
           2 2011-03-01      1   5.1  1.3
           2 2011-04-01      2   5.8  0.6
           2 2011-05-01      3   6.5 -0.1
           2 2011-07-01      4   3.0  5.0
           3 2019-10-01      1   7.9 -1.5
           3 2019-11-01      2   8.6 -2.2
           3 2020-01-01      3  10.0 -3.6
           3 2020-02-01      4  10.7 -4.3
           3 2020-03-01      5   4.0  3.0

Notice: The data is grouped by Serial_no and the date is data reported monthly (first of every month).注意:数据按Serial_no分组, date为每月报告的数据(每月的第一天)。 The Index column is set so each consecutive reported date is a consecutive number in the series.设置Index列,因此每个连续报告的日期都是系列中的连续数字。 The number of reported dates in each group Serial_no are different.每组Serial_no中报告的日期数不同。 The interval of reported dates date are different for each group Serial_no (they don't start or end on the same date for each group).每个组Serial_no的报告日期date间隔不同(它们不会在每个组的同一日期开始或结束)。

The problem: There is no reported data for some dates date in the time series.问题:时间序列中的某些date没有报告数据。 Notice some dates are missing in each Serial_no group.请注意,每个Serial_no组中缺少一些日期。 I want to add a row in each group for those missing dates date and have the data reported in x and y columns as 'NaN'.我想在每个组中为那些缺失的date添加一行,并将xy列中的数据报告为“NaN”。

Example of the dataframe I need:我需要的 dataframe 示例:

   Serial_no       date  Index       x       y
           1 2014-01-01      1     2.0     3.0
           1 2014-02-01      2     NaN     NaN
           1 2014-03-01      3     3.0     3.0
           1 2014-04-01      4     6.0     2.0
           2 2011-03-01      1     5.1     1.3
           2 2011-04-01      2     5.8     0.6
           2 2011-05-01      3     6.5    -0.1
           2 2011-06-01      4     NaN     NaN
           2 2011-07-01      5     3.0     5.0
           3 2019-10-01      1     7.9    -1.5
           3 2019-11-01      2     8.6    -2.2
           3 2019-12-01      3     NaN     NaN
           3 2020-01-01      4    10.0    -3.6
           3 2020-02-01      5    10.7    -4.3
           3 2020-03-01      6     4.0     3.0

I know how to replace the blank cells with NaN once the rows with missing dates are inserted, using the following code:一旦插入缺少日期的行,我知道如何用NaN替换空白单元格,使用以下代码:

import pandas as pd
import numpy as np

df['x'].replace('', np.nan, inplace=True)
df['y'].replace('', np.nan, inplace=True)

I also know how to reset the index once the rows with missing dates are inserted, using the following code:我也知道如何使用以下代码在插入缺少日期的行后重置索引:

df["Index"] = df.groupby("Serial_no",).cumcount('date')

However, I'm unsure how to locate the the missing dates in each group and insert the row for those (monthly reported) dates.但是,我不确定如何找到每个组中缺少的日期并为这些(每月报告的)日期插入行。 Any help is appreciated.任何帮助表示赞赏。

Use custom function with DataFrame.asfreq in GroupBy.apply and then reassign Index by GroupBy.cumcount :GroupBy.apply中使用自定义 function 和DataFrame.asfreq ,然后通过GroupBy.cumcount重新分配Index

df['date'] = pd.to_datetime(df['date'])

df = (df.set_index('date')
        .groupby('Serial_no')
        .apply(lambda x: x.asfreq('MS'))
        .drop('Serial_no', axis=1))
df = df.reset_index()
df["Index"] = df.groupby("Serial_no").cumcount() + 1
print (df)
    Serial_no       date  Index     x    y
0           1 2014-01-01      1   2.0  3.0
1           1 2014-02-01      2   NaN  NaN
2           1 2014-03-01      3   3.0  3.0
3           1 2014-04-01      4   6.0  2.0
4           2 2011-03-01      1   5.1  1.3
5           2 2011-04-01      2   5.8  0.6
6           2 2011-05-01      3   6.5 -0.1
7           2 2011-06-01      4   NaN  NaN
8           2 2011-07-01      5   3.0  5.0
9           3 2019-10-01      1   7.9 -1.5
10          3 2019-11-01      2   8.6 -2.2
11          3 2019-12-01      3   NaN  NaN
12          3 2020-01-01      4  10.0 -3.6
13          3 2020-02-01      5  10.7 -4.3
14          3 2020-03-01      6   4.0  3.0

Alternative solution with DataFrame.reindex : DataFrame.reindex的替代解决方案:

df['date'] = pd.to_datetime(df['date'])

f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='MS', name='date'))
df = df.set_index('date').groupby('Serial_no').apply(f).drop('Serial_no', axis=1)
df = df.reset_index()
df["Index"] = df.groupby("Serial_no").cumcount() + 1

One option is with complete from pyjanitor , which abstracts the process for exposing missing rows:一种选择是使用pyjanitorcomplete ,它抽象了暴露缺失行的过程:

# pip install pyjanitor
import pandas as pd
import janitor

# create a mapping that is applied across each Serial_no group
new_dates = {'date':lamba d: pd.date_range(d.min(), d.max(), freq='MS')}

(df
.complete(new_dates, by='Serial_no')
.assign(Index = lambda df: df.groupby('Serial_no')
                             .Index
                             .cumcount()
                             .add(1))
)
    Serial_no       date  Index     x    y
0           1 2014-01-01      1   2.0  3.0
1           1 2014-02-01      2   NaN  NaN
2           1 2014-03-01      3   3.0  3.0
3           1 2014-04-01      4   6.0  2.0
4           2 2011-03-01      1   5.1  1.3
5           2 2011-04-01      2   5.8  0.6
6           2 2011-05-01      3   6.5 -0.1
7           2 2011-06-01      4   NaN  NaN
8           2 2011-07-01      5   3.0  5.0
9           3 2019-10-01      1   7.9 -1.5
10          3 2019-11-01      2   8.6 -2.2
11          3 2019-12-01      3   NaN  NaN
12          3 2020-01-01      4  10.0 -3.6
13          3 2020-02-01      5  10.7 -4.3
14          3 2020-03-01      6   4.0  3.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM