使用均值填写数据框中的缺失日期

Question

I have dates that I'm pulling into a dataframe at regular intervals. 我有日期，我定期进入数据帧。 The data is generally well-formed, but sometimes there are bad data in an otherwise date column. 数据通常是格式良好的，但有时在其他日期列中存在错误数据。

I would always expect to have a date in the parsed 9 digit form: 我总是希望在解析的9位数字表格中有一个日期：

(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
(2015, 12, 29, 0, 30, 50, 1, 363, 0)

How should I check and fix this? 我该如何检查并修复此问题？

What I would like to do is replace whatever is not a date, with a date based on a variable that represents the last_update + 1/2 the update interval, so the items are not filtered out by later functions. 我想要做的是替换任何不是日期的日期，基于表示last_update + 1/2更新间隔的变量的日期，因此这些项目不会被后续函数过滤掉。

Data as shown is published_parsed from feedparser. 显示的数据是从feedparser发布的。

import pandas as pd
import datetime

# date with ugly data
df_date_ugly = pd.DataFrame({'date': [
                             (2015, 12, 29, 0, 30, 50, 1, 363, 0), 
                             (2015, 12, 28, 23, 59, 12, 0, 362, 0),
                            'None', '',
                             (2015, 12, 28, 23, 59, 12, 0, 362, 0)
                            ]})

# date is fine
df_date =  pd.DataFrame({'date': [
                             (2015, 12, 29, 0, 30, 50, 1, 363, 0), 
                             (2015, 12, 28, 23, 59, 12, 0, 362, 0),
                             (2015, 12, 28, 23, 59, 12, 0, 362, 0)
                            ]})

Pseudocode
  if the original_date is valid
     return original_date
  else
     return substitute_date

Answer 1

import calendar
import numpy as np
import pandas as pd

def tuple_to_timestamp(x):
    try:
        return calendar.timegm(x)               # 1
    except (TypeError, ValueError):
        return np.nan

df = pd.DataFrame({'orig': [
    (2015, 12, 29, 0, 30, 50, 1, 363, 0), 
    (2015, 12, 28, 23, 59, 12, 0, 362, 0),
    'None', '',
    (2015, 12, 30, 23, 59, 12, 0, 362, 0)]})

ts = df['orig'].apply(tuple_to_timestamp)       # 2
# 0    1451349050
# 1    1451347152
# 2           NaN
# 3           NaN
# 4    1451519952
# Name: orig, dtype: float64

ts = ts.interpolate()                           # 3
# 0    1451349050
# 1    1451347152
# 2    1451404752
# 3    1451462352
# 4    1451519952
# Name: orig, dtype: float64

df['fixed'] = pd.to_datetime(ts, unit='s')      # 4

print(df)

yields 产量

                                    orig               fixed
0   (2015, 12, 29, 0, 30, 50, 1, 363, 0) 2015-12-29 00:30:50
1  (2015, 12, 28, 23, 59, 12, 0, 362, 0) 2015-12-28 23:59:12
2                                   None 2015-12-29 15:59:12
3                                        2015-12-30 07:59:12
4  (2015, 12, 30, 23, 59, 12, 0, 362, 0) 2015-12-30 23:59:12

Explanation : 说明：

calendar.timegm converts each time-tuple to a timestamp. calendar.timegm将每个时间元组转换为时间戳。 Unlike time.mktime , it interprets the time-tuple as being in UTC, not local time. 与time.mktime不同，它将时间元组解释为UTC，而不是本地时间。
apply calls tuple_to_timestamp for each row of df['orig'] . 对df['orig']每一行apply调用tuple_to_timestamp 。
The nice thing about timestamps is that they are numeric, so you can then use numerical methods such as Series.interpolate to fill in NaNs with interpolated values. 关于时间戳的Series.interpolate是它们是数字的，因此您可以使用诸如Series.interpolate数值方法来填充具有插值的NaN。 Note that the two NaNs do not get filled with same interpolated value; 请注意，两个NaN 不会填充相同的插值; their values are linearly interpolated based on their position as given by ts.index . 它们的值是根据ts.index给出的位置线性插值的。
pd.to_datetime converts to timestamps to dates. pd.to_datetime将时间戳转换为日期。

Answer 2

When working with dates and times in pandas, convert them to a pandas timestamp using pandas.to_datetime . 在pandas中处理日期和时间时，使用pandas.to_datetime将它们转换为pandas时间戳。 To use this function, we will convert the list into a string with just the date and time elements. 要使用此功能，我们将列表转换为仅包含日期和时间元素的字符串。 For your case, values that are not lists of length 9 will be considered bad and are replaced with a empty string '' . 对于您的情况，不是长度为9的列表的值将被视为错误，并替换为空字符串'' 。
```
 #convert list into string with date & time #only elements with lists of length 9 will be parsed dates_df = df_date_ugly.applymap(lambda x: "{0}/{1}/{2} {3}:{4}:{5}".format(x[0],x[1],x[2], x[3], x[4], x[5]) if len(x)==9 else '') #convert to a pandas timestamp dates_df = pd.to_datetime(dates_df['date'], errors = 'coerce')) date 0 2015-12-29 00:30:50 1 2015-12-28 23:59:12 2 NaT 3 NaT 4 2015-12-28 23:59:12 
```
Find the indices where the dates are missing use pd.isnull() : 找到缺少日期的索引使用pd.isnull() ：
```
 >>>missing = pd.isnull(dates_df['date']).index >>>missing Int64Index([2, 3], dtype='int64') 
```

To set the missing date as the midpoint between 2 dates: 要将缺失日期设置为2个日期之间的中点：

 start_date = dates_df.iloc[0,:] end_date = dates_df.iloc[4,:] missing_date = start_date + (end_date - start_date)/2

使用均值填写数据框中的缺失日期

问题描述

2 个解决方案

解决方案1
3 2016-01-01 11:49:25

解决方案2
2 已采纳 2016-01-01 00:07:05

使用均值填写数据框中的缺失日期

问题描述

2 个解决方案

解决方案1 3 2016-01-01 11:49:25

解决方案2 2 已采纳 2016-01-01 00:07:05

解决方案1
3 2016-01-01 11:49:25

解决方案2
2 已采纳 2016-01-01 00:07:05