[英]Pandas fill in missing monthly dates in DataFrame, fill up one specific column with zeros
I am facing an issue with Pandas and how to fill up missing dates in a DataFrame.我遇到了 Pandas 以及如何在 DataFrame 中填写缺失日期的问题。 The structure of the given DataFrame is as follows:
给定的DataFrame的结构如下:
Amount Code Type Date
0 34.97 J36J 74343 2016-01-01
1 16.32 J36J 74343 2016-04-01
2 10.30 J36J 69927 2015-12-01
3 10.45 J36J 69927 2016-07-01
4 5.63 J36J 69927 2017-03-01
5 15.79 J36J 69927 2018-09-01
6 15.00 J36J 69927 2019-06-01
7 6.44 J36J 69926 2016-03-01
8 6.47 J36J 69926 2017-03-01
9 15.00 J36J 69926 2018-07-01
10 15.00 J36J 69926 2019-06-01
My goal is to have a monthly entry for every Type covering this timespan.我的目标是为涵盖此时间跨度的每种类型提供每月条目。 Meaning, every Material should have 58 entries.
意思是,每个材料应该有 58 个条目。 The 'artificially' created monthly entries should have an amount of 0. So, my expected output would be (just for one Type, as an example)
“人为”创建的每月条目的数量应为 0。因此,我预期的 output 将是(仅针对一种类型,例如)
Amount Code Type Date
0 34.97 J36J 74343 2016-01-01
1 16.32 J36J 74343 2016-02-01
2 0 J36J 74343 2016-03-01
3 0 J36J 74343 2016-04-01
4 0 J36J 74343 2016-05-01
5 0 J36J 74343 2016-06-01
6 0 J36J 74343 2016-07-01
7 0 J36J 74343 2016-08-01
8 0 J36J 74343 2016-09-01
9 0 J36J 74343 2016-10-01
10 0 J36J 74343 2016-11-01
11 0 J36J 74343 2016-12-01
Fortunately, somebody already had the same question ( Pandas fill in missing dates in DataFrame with multiple columns )幸运的是,有人已经有同样的问题( Pandas 用多列填写 DataFrame 中缺少的日期)
I adapted the quite helpful answer to my case:我调整了对我的案例很有帮助的答案:
df.Date=pd.to_datetime(df.Date)
s=pd.date_range(df.Date.min(),df.Date.max(),freq='MS')
df=df.set_index(['Code','Type','Date']).\
Amount.unstack().reindex(columns=s,fill_value=0).stack().reset_index()
df
This worked quite well, but I checked the resulting DataFrame afterwards and it seems like some of the dates are missing.这工作得很好,但我后来检查了生成的 DataFrame ,似乎有些日期丢失了。
398 74343 J36J 2016-01-01 34.97
399 74343 J36J 2016-02-01 0.00
400 74343 J36J 2016-04-01 16.32
401 74343 J36J 2016-05-01 0.00
402 74343 J36J 2016-06-01 0.00
403 74343 J36J 2016-08-01 0.00
404 74343 J36J 2016-10-01 0.00
405 74343 J36J 2016-11-01 0.00
406 74343 J36J 2016-12-01 0.00
Do any of you know what could be the reason for this?你们中有人知道这可能是什么原因吗? I'm assuming maybe it's because of the Frequency ('MS') I've chosen?
我假设可能是因为我选择的频率('MS')? but I cannot think any of the others could be fitting.
但我认为其他任何一个都不合适。 ( https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html ) OR do I have to set the datarange manually?
( https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html )还是我必须手动设置数据范围? In my initial DataFrame obviously not all Dates are available.
在我最初的 DataFrame 中,显然并非所有日期都可用。
Any help on that matter is appreciated.对此问题的任何帮助表示赞赏。
BR BR
This was a subtle one, lots of fun.这是一个微妙的,很有趣。
import pandas as pd
data = {'Amount' :[34.97, 16.32, 10.3, 10.45, 5.63, 15.79, 15, 6.44, 6.47, 15, 15],
'Code': ['J36J','J36J','J36J','J36J','J36J','J36J','J36J','J36J','J36J','J36J','J36J'],
'Type': [74343,74343,69927,69927,9927,69927,69927,69926,69926,69926,69926],
'Date': ['1/1/2016','4/1/2016','12/1/2015','7/1/2016','3/1/2017','9/1/2018','6/1/2019','3/1/2016','3/1/2017','7/1/2018','6/1/2019']}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df
this got the starting point of the values that were above.这得到了上述值的起点。 Then working out what happened took a while, the problem was we were using the same s for the whole of the types not individually.
然后弄清楚发生了什么需要一段时间,问题是我们对所有类型使用相同的 s 而不是单独使用。 So if a date was in another type it was not overwritten.
因此,如果日期是另一种类型,则不会被覆盖。
To solve this I did this in pieces so that we could build it back together.为了解决这个问题,我把它分成几部分,这样我们就可以把它重新组装起来。
outdf = pd.DataFrame(columns = df.columns)
s=pd.date_range(df.Date.min(),df.Date.max(),freq='MS')
for name, subdf in df.groupby('Type'):
thisdf=subdf.set_index(['Code','Type','Date']).\
Amount.unstack().reindex(columns=s,fill_value=0).stack().reset_index()
thisdf.rename(columns={0: "Amount", "level_2": "Date"}, errors="raise",inplace=True)
thisdf.reset_index(inplace=True)
thisdf = thisdf[['Code', 'Type', 'Date', 'Amount']]
outdf = pd.concat([outdf,thisdf])
outdf = outdf[['Code', 'Type', 'Date', 'Amount']]
outdf.reset_index(inplace=True)
outdf = outdf[['Code', 'Type', 'Date', 'Amount']]
So what we did is broke it into individual items then glued them back together after each time through the groupby.所以我们所做的是将它分解成单独的项目,然后在每次通过 groupby 后将它们粘在一起。 Then we would have no missing dates from other types coming through on this type.
这样我们就不会错过其他类型的日期。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.