[英]Fill in missing dates in dataframe using the mean
I have dates that I'm pulling into a dataframe at regular intervals. 我有日期,我定期进入数据帧。 The data is generally well-formed, but sometimes there are bad data in an otherwise date column. 数据通常是格式良好的,但有时在其他日期列中存在错误数据。
I would always expect to have a date in the parsed 9 digit form: 我总是希望在解析的9位数字表格中有一个日期:
(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
(2015, 12, 29, 0, 30, 50, 1, 363, 0)
How should I check and fix this? 我该如何检查并修复此问题?
What I would like to do is replace whatever is not a date, with a date based on a variable that represents the last_update + 1/2 the update interval, so the items are not filtered out by later functions. 我想要做的是替换任何不是日期的日期,基于表示last_update + 1/2更新间隔的变量的日期,因此这些项目不会被后续函数过滤掉。
Data as shown is published_parsed from feedparser. 显示的数据是从feedparser发布的。
import pandas as pd
import datetime
# date with ugly data
df_date_ugly = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
# date is fine
df_date = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
Pseudocode
if the original_date is valid
return original_date
else
return substitute_date
import calendar
import numpy as np
import pandas as pd
def tuple_to_timestamp(x):
try:
return calendar.timegm(x) # 1
except (TypeError, ValueError):
return np.nan
df = pd.DataFrame({'orig': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 30, 23, 59, 12, 0, 362, 0)]})
ts = df['orig'].apply(tuple_to_timestamp) # 2
# 0 1451349050
# 1 1451347152
# 2 NaN
# 3 NaN
# 4 1451519952
# Name: orig, dtype: float64
ts = ts.interpolate() # 3
# 0 1451349050
# 1 1451347152
# 2 1451404752
# 3 1451462352
# 4 1451519952
# Name: orig, dtype: float64
df['fixed'] = pd.to_datetime(ts, unit='s') # 4
print(df)
yields 产量
orig fixed
0 (2015, 12, 29, 0, 30, 50, 1, 363, 0) 2015-12-29 00:30:50
1 (2015, 12, 28, 23, 59, 12, 0, 362, 0) 2015-12-28 23:59:12
2 None 2015-12-29 15:59:12
3 2015-12-30 07:59:12
4 (2015, 12, 30, 23, 59, 12, 0, 362, 0) 2015-12-30 23:59:12
Explanation : 说明 :
calendar.timegm
converts each time-tuple to a timestamp. calendar.timegm
将每个时间元组转换为时间戳。 Unlike time.mktime
, it interprets the time-tuple as being in UTC, not local time. 与time.mktime
不同,它将时间元组解释为UTC,而不是本地时间。
apply
calls tuple_to_timestamp
for each row of df['orig']
. 对df['orig']
每一行apply
调用tuple_to_timestamp
。
The nice thing about timestamps is that they are numeric, so you can then use numerical methods such as Series.interpolate
to fill in NaNs with interpolated values. 关于时间戳的Series.interpolate
是它们是数字的,因此您可以使用诸如Series.interpolate
数值方法来填充具有插值的NaN。 Note that the two NaNs do not get filled with same interpolated value; 请注意,两个NaN 不会填充相同的插值; their values are linearly interpolated based on their position as given by ts.index
. 它们的值是根据ts.index
给出的位置线性插值的。
pd.to_datetime
converts to timestamps to dates. pd.to_datetime
将时间戳转换为日期。
When working with dates and times in pandas, convert them to a pandas timestamp using pandas.to_datetime
. 在pandas中处理日期和时间时,使用pandas.to_datetime
将它们转换为pandas时间戳 。 To use this function, we will convert the list into a string with just the date and time elements. 要使用此功能,我们将列表转换为仅包含日期和时间元素的字符串。 For your case, values that are not lists of length 9 will be considered bad and are replaced with a empty string ''
. 对于您的情况,不是长度为9的列表的值将被视为错误 ,并替换为空字符串''
。
#convert list into string with date & time #only elements with lists of length 9 will be parsed dates_df = df_date_ugly.applymap(lambda x: "{0}/{1}/{2} {3}:{4}:{5}".format(x[0],x[1],x[2], x[3], x[4], x[5]) if len(x)==9 else '') #convert to a pandas timestamp dates_df = pd.to_datetime(dates_df['date'], errors = 'coerce')) date 0 2015-12-29 00:30:50 1 2015-12-28 23:59:12 2 NaT 3 NaT 4 2015-12-28 23:59:12
Find the indices where the dates are missing use pd.isnull()
: 找到缺少日期的索引使用pd.isnull()
:
>>>missing = pd.isnull(dates_df['date']).index >>>missing Int64Index([2, 3], dtype='int64')
To set the missing date as the midpoint between 2 dates: 要将缺失日期设置为2个日期之间的中点:
start_date = dates_df.iloc[0,:] end_date = dates_df.iloc[4,:] missing_date = start_date + (end_date - start_date)/2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.