[英]Python pandas: insert rows for missing dates, time series in groupby dataframe
I have a dataframe df
:我有一个 dataframe df
:
Serial_no date Index x y
1 2014-01-01 1 2.0 3.0
1 2014-03-01 2 3.0 3.0
1 2014-04-01 3 6.0 2.0
2 2011-03-01 1 5.1 1.3
2 2011-04-01 2 5.8 0.6
2 2011-05-01 3 6.5 -0.1
2 2011-07-01 4 3.0 5.0
3 2019-10-01 1 7.9 -1.5
3 2019-11-01 2 8.6 -2.2
3 2020-01-01 3 10.0 -3.6
3 2020-02-01 4 10.7 -4.3
3 2020-03-01 5 4.0 3.0
Notice: The data is grouped by Serial_no
and the date
is data reported monthly (first of every month).注意:数据按Serial_no
分组, date
为每月报告的数据(每月的第一天)。 The Index
column is set so each consecutive reported date is a consecutive number in the series.设置Index
列,因此每个连续报告的日期都是系列中的连续数字。 The number of reported dates in each group Serial_no
are different.每组Serial_no
中报告的日期数不同。 The interval of reported dates date
are different for each group Serial_no
(they don't start or end on the same date for each group).每个组Serial_no
的报告日期date
间隔不同(它们不会在每个组的同一日期开始或结束)。
The problem: There is no reported data for some dates date
in the time series.问题:时间序列中的某些date
没有报告数据。 Notice some dates are missing in each Serial_no
group.请注意,每个Serial_no
组中缺少一些日期。 I want to add a row in each group for those missing dates date
and have the data reported in x
and y
columns as 'NaN'.我想在每个组中为那些缺失的date
添加一行,并将x
和y
列中的数据报告为“NaN”。
Example of the dataframe I need:我需要的 dataframe 示例:
Serial_no date Index x y
1 2014-01-01 1 2.0 3.0
1 2014-02-01 2 NaN NaN
1 2014-03-01 3 3.0 3.0
1 2014-04-01 4 6.0 2.0
2 2011-03-01 1 5.1 1.3
2 2011-04-01 2 5.8 0.6
2 2011-05-01 3 6.5 -0.1
2 2011-06-01 4 NaN NaN
2 2011-07-01 5 3.0 5.0
3 2019-10-01 1 7.9 -1.5
3 2019-11-01 2 8.6 -2.2
3 2019-12-01 3 NaN NaN
3 2020-01-01 4 10.0 -3.6
3 2020-02-01 5 10.7 -4.3
3 2020-03-01 6 4.0 3.0
I know how to replace the blank cells with NaN
once the rows with missing dates are inserted, using the following code:一旦插入缺少日期的行,我知道如何用NaN
替换空白单元格,使用以下代码:
import pandas as pd
import numpy as np
df['x'].replace('', np.nan, inplace=True)
df['y'].replace('', np.nan, inplace=True)
I also know how to reset the index once the rows with missing dates are inserted, using the following code:我也知道如何使用以下代码在插入缺少日期的行后重置索引:
df["Index"] = df.groupby("Serial_no",).cumcount('date')
However, I'm unsure how to locate the the missing dates in each group and insert the row for those (monthly reported) dates.但是,我不确定如何找到每个组中缺少的日期并为这些(每月报告的)日期插入行。 Any help is appreciated.任何帮助表示赞赏。
Use custom function with DataFrame.asfreq
in GroupBy.apply
and then reassign Index
by GroupBy.cumcount
:在GroupBy.apply
中使用自定义 function 和DataFrame.asfreq
,然后通过GroupBy.cumcount
重新分配Index
:
df['date'] = pd.to_datetime(df['date'])
df = (df.set_index('date')
.groupby('Serial_no')
.apply(lambda x: x.asfreq('MS'))
.drop('Serial_no', axis=1))
df = df.reset_index()
df["Index"] = df.groupby("Serial_no").cumcount() + 1
print (df)
Serial_no date Index x y
0 1 2014-01-01 1 2.0 3.0
1 1 2014-02-01 2 NaN NaN
2 1 2014-03-01 3 3.0 3.0
3 1 2014-04-01 4 6.0 2.0
4 2 2011-03-01 1 5.1 1.3
5 2 2011-04-01 2 5.8 0.6
6 2 2011-05-01 3 6.5 -0.1
7 2 2011-06-01 4 NaN NaN
8 2 2011-07-01 5 3.0 5.0
9 3 2019-10-01 1 7.9 -1.5
10 3 2019-11-01 2 8.6 -2.2
11 3 2019-12-01 3 NaN NaN
12 3 2020-01-01 4 10.0 -3.6
13 3 2020-02-01 5 10.7 -4.3
14 3 2020-03-01 6 4.0 3.0
Alternative solution with DataFrame.reindex
: DataFrame.reindex
的替代解决方案:
df['date'] = pd.to_datetime(df['date'])
f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='MS', name='date'))
df = df.set_index('date').groupby('Serial_no').apply(f).drop('Serial_no', axis=1)
df = df.reset_index()
df["Index"] = df.groupby("Serial_no").cumcount() + 1
One option is with complete from pyjanitor , which abstracts the process for exposing missing rows:一种选择是使用pyjanitor的complete ,它抽象了暴露缺失行的过程:
# pip install pyjanitor
import pandas as pd
import janitor
# create a mapping that is applied across each Serial_no group
new_dates = {'date':lamba d: pd.date_range(d.min(), d.max(), freq='MS')}
(df
.complete(new_dates, by='Serial_no')
.assign(Index = lambda df: df.groupby('Serial_no')
.Index
.cumcount()
.add(1))
)
Serial_no date Index x y
0 1 2014-01-01 1 2.0 3.0
1 1 2014-02-01 2 NaN NaN
2 1 2014-03-01 3 3.0 3.0
3 1 2014-04-01 4 6.0 2.0
4 2 2011-03-01 1 5.1 1.3
5 2 2011-04-01 2 5.8 0.6
6 2 2011-05-01 3 6.5 -0.1
7 2 2011-06-01 4 NaN NaN
8 2 2011-07-01 5 3.0 5.0
9 3 2019-10-01 1 7.9 -1.5
10 3 2019-11-01 2 8.6 -2.2
11 3 2019-12-01 3 NaN NaN
12 3 2020-01-01 4 10.0 -3.6
13 3 2020-02-01 5 10.7 -4.3
14 3 2020-03-01 6 4.0 3.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.