简体   繁体   English

Pandas 填写 DataFrame 中缺少的每月日期,用零填充一个特定列

[英]Pandas fill in missing monthly dates in DataFrame, fill up one specific column with zeros

I am facing an issue with Pandas and how to fill up missing dates in a DataFrame.我遇到了 Pandas 以及如何在 DataFrame 中填写缺失日期的问题。 The structure of the given DataFrame is as follows:给定的DataFrame的结构如下:

     Amount  Code     Type   Date
0     34.97  J36J     74343 2016-01-01
1     16.32  J36J     74343 2016-04-01
2     10.30  J36J     69927 2015-12-01
3     10.45  J36J     69927 2016-07-01
4      5.63  J36J     69927 2017-03-01
5     15.79  J36J     69927 2018-09-01
6     15.00  J36J     69927 2019-06-01
7      6.44  J36J     69926 2016-03-01
8      6.47  J36J     69926 2017-03-01
9     15.00  J36J     69926 2018-07-01
10    15.00  J36J     69926 2019-06-01
  • Amount: well, the amount金额:嗯,金额
  • Code: Productcode which is the same throughout the entire DataFrame代号:产品代号在整个DataFrame中都是一样的
  • Type: A Producttype, there are many different ones Type:A Producttype,有很多不同的
  • Date: A Daterange which spans the time between December 2015 to September 2020.日期:跨越 2015 年 12 月至 2020 年 9 月之间时间的日期范围。

My goal is to have a monthly entry for every Type covering this timespan.我的目标是为涵盖此时间跨度的每种类型提供每月条目。 Meaning, every Material should have 58 entries.意思是,每个材料应该有 58 个条目。 The 'artificially' created monthly entries should have an amount of 0. So, my expected output would be (just for one Type, as an example) “人为”创建的每月条目的数量应为 0。因此,我预期的 output 将是(仅针对一种类型,例如)

     Amount  Code     Type   Date
0     34.97  J36J     74343 2016-01-01
1     16.32  J36J     74343 2016-02-01
2     0      J36J     74343 2016-03-01
3     0      J36J     74343 2016-04-01
4     0      J36J     74343 2016-05-01
5     0      J36J     74343 2016-06-01
6     0      J36J     74343 2016-07-01
7     0      J36J     74343 2016-08-01
8     0      J36J     74343 2016-09-01
9     0      J36J     74343 2016-10-01
10    0      J36J     74343 2016-11-01
11    0      J36J     74343 2016-12-01

Fortunately, somebody already had the same question ( Pandas fill in missing dates in DataFrame with multiple columns )幸运的是,有人已经有同样的问题( Pandas 用多列填写 DataFrame 中缺少的日期

I adapted the quite helpful answer to my case:我调整了对我的案例很有帮助的答案:

df.Date=pd.to_datetime(df.Date)
s=pd.date_range(df.Date.min(),df.Date.max(),freq='MS')

df=df.set_index(['Code','Type','Date']).\
      Amount.unstack().reindex(columns=s,fill_value=0).stack().reset_index()
df

This worked quite well, but I checked the resulting DataFrame afterwards and it seems like some of the dates are missing.这工作得很好,但我后来检查了生成的 DataFrame ,似乎有些日期丢失了。

398     74343  J36J 2016-01-01  34.97
399     74343  J36J 2016-02-01   0.00
400     74343  J36J 2016-04-01  16.32
401     74343  J36J 2016-05-01   0.00
402     74343  J36J 2016-06-01   0.00
403     74343  J36J 2016-08-01   0.00
404     74343  J36J 2016-10-01   0.00
405     74343  J36J 2016-11-01   0.00
406     74343  J36J 2016-12-01   0.00

Do any of you know what could be the reason for this?你们中有人知道这可能是什么原因吗? I'm assuming maybe it's because of the Frequency ('MS') I've chosen?我假设可能是因为我选择的频率('MS')? but I cannot think any of the others could be fitting.但我认为其他任何一个都不合适。 ( https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html ) OR do I have to set the datarange manually? https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html )还是我必须手动设置数据范围? In my initial DataFrame obviously not all Dates are available.在我最初的 DataFrame 中,显然并非所有日期都可用。

Any help on that matter is appreciated.对此问题的任何帮助表示赞赏。

BR BR

This was a subtle one, lots of fun.这是一个微妙的,很有趣。

import pandas as pd
data = {'Amount' :[34.97, 16.32, 10.3, 10.45, 5.63, 15.79, 15, 6.44, 6.47, 15, 15],
'Code': ['J36J','J36J','J36J','J36J','J36J','J36J','J36J','J36J','J36J','J36J','J36J'],
'Type': [74343,74343,69927,69927,9927,69927,69927,69926,69926,69926,69926],
'Date': ['1/1/2016','4/1/2016','12/1/2015','7/1/2016','3/1/2017','9/1/2018','6/1/2019','3/1/2016','3/1/2017','7/1/2018','6/1/2019']}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df

this got the starting point of the values that were above.这得到了上述值的起点。 Then working out what happened took a while, the problem was we were using the same s for the whole of the types not individually.然后弄清楚发生了什么需要一段时间,问题是我们对所有类型使用相同的 s 而不是单独使用。 So if a date was in another type it was not overwritten.因此,如果日期是另一种类型,则不会被覆盖。

To solve this I did this in pieces so that we could build it back together.为了解决这个问题,我把它分成几部分,这样我们就可以把它重新组装起来。

outdf = pd.DataFrame(columns = df.columns)
s=pd.date_range(df.Date.min(),df.Date.max(),freq='MS')
for name, subdf in df.groupby('Type'):
    thisdf=subdf.set_index(['Code','Type','Date']).\
        Amount.unstack().reindex(columns=s,fill_value=0).stack().reset_index()
    thisdf.rename(columns={0: "Amount", "level_2": "Date"}, errors="raise",inplace=True)
    thisdf.reset_index(inplace=True)
    thisdf = thisdf[['Code', 'Type', 'Date', 'Amount']]
    outdf = pd.concat([outdf,thisdf])
    outdf = outdf[['Code', 'Type', 'Date', 'Amount']]

outdf.reset_index(inplace=True)
outdf = outdf[['Code', 'Type', 'Date', 'Amount']]

So what we did is broke it into individual items then glued them back together after each time through the groupby.所以我们所做的是将它分解成单独的项目,然后在每次通过 groupby 后将它们粘在一起。 Then we would have no missing dates from other types coming through on this type.这样我们就不会错过其他类型的日期。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM