[英]Pandas Reindex to Fill Missing Dates, or Better Method to Fill?
My data is absence records from a factory. 我的数据是来自工厂的缺勤记录。 Some days there are no absences so there is no data or date recorded for that day. 有些日子没有缺席,所以当天没有记录数据或日期。 However, and where this gets hairy with the other examples shown, is on any given day there can be several absences for various reasons. 然而,如果显示的其他示例变得毛茸茸,那么在任何一天,由于各种原因可能会有几次缺席。 There is not always a 1 to 1 ratio of date-to-record in the data. 数据中的日期与记录的比率并不总是1比1。
The result I'm hoping for is something like this: 我希望的结果是这样的:
(index) Shift Description Instances (SUM)
01-01-14 2nd Baker Discipline 0
01-01-14 2nd Baker Vacation 0
01-01-14 1st Cooks Discipline 0
01-01-14 1st Cooks Vacation 0
01-02-14 2nd Baker Discipline 4
01-02-14 2nd Baker Vacation 3
01-02-14 1st Cooks Discipline 3
01-02-14 1st Cooks Vacation 3
And so on. 等等。 The idea is all shifts and descriptions will have values for all days in the time period (in this example 1/1/2014 - 12/31/2014) 这个想法是所有班次,描述将包含该时间段内所有日期的值(在此示例2014年1月1日 - 2014年12月31日)
I've read several examples and the closest I've come to getting this working is here . 我已经阅读了几个例子,最接近我的工作就是在这里 。
ts = pd.read_csv('Absentee_Data_2.csv'
, encoding = 'utf-8'
,parse_dates=[3]
,index_col=3
,dayfirst=True
)
idx = pd.date_range('01.01.2009', '12.31.2017')
ts.index = pd.DatetimeIndex(ts.index)
# ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')
But, when I uncomment the ts = ts.reindex(idx, fill_value='NaN')
I get error messages. 但是,当我取消注释ts = ts.reindex(idx, fill_value='NaN')
我收到错误消息。 I've tried at least 10 other ways to accomplish what I'm trying to do so I'm not 100% sure this is the right path, but it seems to have gotten me closest to any kind of progress. 我已经尝试了至少10种其他方法来完成我想要做的事情所以我不是百分之百确定这是正确的道路,但它似乎让我最接近任何进展。
Here's some sample data: 这是一些示例数据:
Description Unexcused Instances Date Shift
Discipline FALSE 1 Jan 2 2014 2nd Baker
Vacation TRUE 2 Jan 2 2014 1st Cooks
Discipline FALSE 3 Jan 2 2014 2nd Baker
Vacation TRUE 1 Jan 2 2014 1st Cooks
Discipline FALSE 2 Apr 8 2014 2nd Baker
Vacation TRUE 3 Apr 8 2014 1st Cooks
Discipline FALSE 1 Jun 1 2014 2nd Baker
Vacation TRUE 2 Jun 1 2014 1st Cooks
Discipline FALSE 3 Jun 1 2014 2nd Baker
Vacation TRUE 1 Jun 1 2014 1st Cooks
Vacation TRUE 2 Jul 5 2014 1st Cooks
Discipline FALSE 3 Jul 5 2014 2nd Baker
Vacation TRUE 2 Dec 3 2014 1st Cooks
Thank you in advance for you help, I'm a newbie and 2 days into this without much progress. 提前谢谢你的帮助,我是一个新手,2天没有太大进展。 I really appreciate how people here help with answers but most importantly instruction on why the solutions work. 我非常感谢这里的人们如何帮助解答,但最重要的是指导解决方案的工作原理。 Newbies like me are very grateful for the wisdom shared. 像我这样的新手非常感谢分享的智慧。
I think you just have a problem with the use of datetime, this approach worked for me 我认为你在使用datetime时遇到了问题,这种方法对我有用
ts.set_index(['Date'],inplace=True)
ts.index = pd.to_datetime(ts.index,format='%b %d %Y')
d2 = pd.DataFrame(index=pd.date_range('2014-01-01','2014-12-31'))
print ts.join(d2,how='right')
Actually you were pretty close of what you wanted (assuming I understood correctly the output you seem to be looking for). 实际上你非常接近你想要的东西(假设我正确理解了你想要的输出)。 See my additions to your code above: 请参阅上面代码中我添加的内容:
import pandas as pd
ts = pd.read_csv('Absentee_Data_2.csv', encoding = 'utf-8',parse_dates=[3],index_col=3,dayfirst=True, sep=",")
idx = pd.date_range('01.01.2009', '12.31.2017')
ts.index = pd.DatetimeIndex(ts.index)
#ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')
df2 = df1.copy()
df3 = df1.copy()
df4 = df1.copy()
dict1 = {'Description': 'Discipline', 'Instances': 0, 'Shift': '1st Cooks'}
df1 = df1.fillna(dict1)
dict1["Description"] = "Vacation"
df2 = df2.fillna(dict1)
dict1["Shift"] = "2nd Baker"
df3 = df3.fillna(dict1)
dict1["Description"] = "Discipline"
df4 = df4.fillna(dict1)
df_with_duplicates = pd.concat([df1,df2,df3,df4])
final_res = df_with_duplicates.reset_index().drop_duplicates(subset=["index"] + list(dict1.keys())).set_index("index").drop("Unexcused", axis=1)
Basically what you'd add: 基本上你要添加的内容:
ts
( df1
) 将用ts
创建的几乎为空的df复制4次( df1
) fillna(dict1)
allows to fill with static values all the NaN in the columns fillna(dict1)
允许在列中填充所有NaN的静态值 reset_index
followed by the `set_index("index") 删除重复项,我们需要索引来保持添加的值,因此reset_index
后跟`set_index(“index”) Finally a few output: 最后几个输出:
In [5]: final_res["2013-01-2"]
Out[5]:
Description Instances Shift
index
2013-01-02 Discipline 0.0 1st Cooks
2013-01-02 Vacation 0.0 1st Cooks
2013-01-02 Vacation 0.0 2nd Baker
2013-01-02 Discipline 0.0 2nd Baker
In [6]: final_res["2014-01-2"]
Out[6]:
Description Instances Shift
index
2014-01-02 Discipline 1.0 2nd Baker
2014-01-02 Vacation 2.0 1st Cooks
2014-01-02 Discipline 3.0 2nd Baker
2014-01-02 Vacation 1.0 1st Cooks
1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.