简体   繁体   English

Pandas Reindex填补缺失的日期或更好的填充方法?

[英]Pandas Reindex to Fill Missing Dates, or Better Method to Fill?

My data is absence records from a factory. 我的数据是来自工厂的缺勤记录。 Some days there are no absences so there is no data or date recorded for that day. 有些日子没有缺席,所以当天没有记录数据或日期。 However, and where this gets hairy with the other examples shown, is on any given day there can be several absences for various reasons. 然而,如果显示的其他示例变得毛茸茸,那么在任何一天,由于各种原因可能会有几次缺席。 There is not always a 1 to 1 ratio of date-to-record in the data. 数据中的日期与记录的比率并不总是1比1。

The result I'm hoping for is something like this: 我希望的结果是这样的:

(index)    Shift        Description     Instances (SUM)
01-01-14   2nd Baker    Discipline      0
01-01-14   2nd Baker    Vacation        0
01-01-14   1st Cooks    Discipline      0
01-01-14   1st Cooks    Vacation        0
01-02-14   2nd Baker    Discipline      4
01-02-14   2nd Baker    Vacation        3
01-02-14   1st Cooks    Discipline      3
01-02-14   1st Cooks    Vacation        3

And so on. 等等。 The idea is all shifts and descriptions will have values for all days in the time period (in this example 1/1/2014 - 12/31/2014) 这个想法是所有班次,描述将包含该时间段内所有日期的值(在此示例2014年1月1日 - 2014年12月31日)

I've read several examples and the closest I've come to getting this working is here . 我已经阅读了几个例子,最接近我的工作就是在这里

ts = pd.read_csv('Absentee_Data_2.csv'
                , encoding = 'utf-8'
                ,parse_dates=[3]
                ,index_col=3
                ,dayfirst=True
                )

idx =  pd.date_range('01.01.2009', '12.31.2017')

ts.index = pd.DatetimeIndex(ts.index)
# ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')

But, when I uncomment the ts = ts.reindex(idx, fill_value='NaN') I get error messages. 但是,当我取消注释ts = ts.reindex(idx, fill_value='NaN')我收到错误消息。 I've tried at least 10 other ways to accomplish what I'm trying to do so I'm not 100% sure this is the right path, but it seems to have gotten me closest to any kind of progress. 我已经尝试了至少10种其他方法来完成我想要做的事情所以我不是百分之百确定这是正确的道路,但它似乎让我最接近任何进展。

Here's some sample data: 这是一些示例数据:

Description Unexcused   Instances   Date        Shift
Discipline  FALSE              1    Jan 2 2014  2nd Baker
Vacation    TRUE               2    Jan 2 2014  1st Cooks
Discipline  FALSE              3    Jan 2 2014  2nd Baker
Vacation    TRUE               1    Jan 2 2014  1st Cooks
Discipline  FALSE              2    Apr 8 2014  2nd Baker
Vacation    TRUE               3    Apr 8 2014  1st Cooks
Discipline  FALSE              1    Jun 1 2014  2nd Baker
Vacation    TRUE               2    Jun 1 2014  1st Cooks
Discipline  FALSE              3    Jun 1 2014  2nd Baker
Vacation    TRUE               1    Jun 1 2014  1st Cooks
Vacation    TRUE               2    Jul 5 2014  1st Cooks
Discipline  FALSE              3    Jul 5 2014  2nd Baker
Vacation    TRUE               2    Dec 3 2014  1st Cooks

Thank you in advance for you help, I'm a newbie and 2 days into this without much progress. 提前谢谢你的帮助,我是一个新手,2天没有太大进展。 I really appreciate how people here help with answers but most importantly instruction on why the solutions work. 我非常感谢这里的人们如何帮助解答,但最重要的是指导解决方案的工作原理。 Newbies like me are very grateful for the wisdom shared. 像我这样的新手非常感谢分享的智慧。

I think you just have a problem with the use of datetime, this approach worked for me 我认为你在使用datetime时遇到了问题,这种方法对我有用

ts.set_index(['Date'],inplace=True)
ts.index = pd.to_datetime(ts.index,format='%b %d %Y')
d2 = pd.DataFrame(index=pd.date_range('2014-01-01','2014-12-31'))

print ts.join(d2,how='right')

Actually you were pretty close of what you wanted (assuming I understood correctly the output you seem to be looking for). 实际上你非常接近你想要的东西(假设我正确理解了你想要的输出)。 See my additions to your code above: 请参阅上面代码中我添加的内容:

import pandas as pd

ts = pd.read_csv('Absentee_Data_2.csv', encoding = 'utf-8',parse_dates=[3],index_col=3,dayfirst=True, sep=",")

idx =  pd.date_range('01.01.2009', '12.31.2017')

ts.index = pd.DatetimeIndex(ts.index)
#ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')
df2 = df1.copy()
df3 = df1.copy()
df4 = df1.copy()
dict1 = {'Description': 'Discipline', 'Instances': 0, 'Shift': '1st Cooks'}
df1 = df1.fillna(dict1)
dict1["Description"] = "Vacation"
df2 = df2.fillna(dict1)
dict1["Shift"] = "2nd Baker"
df3 = df3.fillna(dict1)
dict1["Description"] = "Discipline"
df4 = df4.fillna(dict1)
df_with_duplicates = pd.concat([df1,df2,df3,df4])
final_res = df_with_duplicates.reset_index().drop_duplicates(subset=["index"] + list(dict1.keys())).set_index("index").drop("Unexcused", axis=1)

Basically what you'd add: 基本上你要添加的内容:

  • Copy 4 times the almost empty df created with ts ( df1 ) 将用ts创建的几乎为空的df复制4次( df1
  • fillna(dict1) allows to fill with static values all the NaN in the columns fillna(dict1)允许在列中填充所有NaN的静态值
  • Concatenate the 4 dfs, we still need to remove some duplicates as the original values from the csv are duplicated 4 times 连接4个dfs,我们仍然需要删除一些重复项,因为csv的原始值重复4次
  • Drop the duplicates, we need the index to keep the values added, thus the reset_index followed by the `set_index("index") 删除重复项,我们需要索引来保持添加的值,因此reset_index后跟`set_index(“index”)
  • Finally drop the Unexcused column 最后删除Unexcused

Finally a few output: 最后几个输出:

In [5]: final_res["2013-01-2"]
Out[5]: 
           Description  Instances      Shift
index                                       
2013-01-02  Discipline        0.0  1st Cooks
2013-01-02    Vacation        0.0  1st Cooks
2013-01-02    Vacation        0.0  2nd Baker
2013-01-02  Discipline        0.0  2nd Baker

In [6]: final_res["2014-01-2"]
Out[6]: 
           Description  Instances       Shift
index                                        
2014-01-02  Discipline        1.0   2nd Baker
2014-01-02    Vacation        2.0   1st Cooks
2014-01-02  Discipline        3.0   2nd Baker
2014-01-02    Vacation        1.0   1st Cooks
1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM