简体   繁体   中英

Fill in missing dates in dataframe using the mean

I have dates that I'm pulling into a dataframe at regular intervals. The data is generally well-formed, but sometimes there are bad data in an otherwise date column.

I would always expect to have a date in the parsed 9 digit form:

(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
(2015, 12, 29, 0, 30, 50, 1, 363, 0)

How should I check and fix this?

What I would like to do is replace whatever is not a date, with a date based on a variable that represents the last_update + 1/2 the update interval, so the items are not filtered out by later functions.

Data as shown is published_parsed from feedparser.

import pandas as pd
import datetime

# date with ugly data
df_date_ugly = pd.DataFrame({'date': [
                             (2015, 12, 29, 0, 30, 50, 1, 363, 0), 
                             (2015, 12, 28, 23, 59, 12, 0, 362, 0),
                            'None', '',
                             (2015, 12, 28, 23, 59, 12, 0, 362, 0)
                            ]})

# date is fine
df_date =  pd.DataFrame({'date': [
                             (2015, 12, 29, 0, 30, 50, 1, 363, 0), 
                             (2015, 12, 28, 23, 59, 12, 0, 362, 0),
                             (2015, 12, 28, 23, 59, 12, 0, 362, 0)
                            ]})

Pseudocode
  if the original_date is valid
     return original_date
  else
     return substitute_date
import calendar
import numpy as np
import pandas as pd

def tuple_to_timestamp(x):
    try:
        return calendar.timegm(x)               # 1
    except (TypeError, ValueError):
        return np.nan

df = pd.DataFrame({'orig': [
    (2015, 12, 29, 0, 30, 50, 1, 363, 0), 
    (2015, 12, 28, 23, 59, 12, 0, 362, 0),
    'None', '',
    (2015, 12, 30, 23, 59, 12, 0, 362, 0)]})

ts = df['orig'].apply(tuple_to_timestamp)       # 2
# 0    1451349050
# 1    1451347152
# 2           NaN
# 3           NaN
# 4    1451519952
# Name: orig, dtype: float64

ts = ts.interpolate()                           # 3
# 0    1451349050
# 1    1451347152
# 2    1451404752
# 3    1451462352
# 4    1451519952
# Name: orig, dtype: float64

df['fixed'] = pd.to_datetime(ts, unit='s')      # 4

print(df)

yields

                                    orig               fixed
0   (2015, 12, 29, 0, 30, 50, 1, 363, 0) 2015-12-29 00:30:50
1  (2015, 12, 28, 23, 59, 12, 0, 362, 0) 2015-12-28 23:59:12
2                                   None 2015-12-29 15:59:12
3                                        2015-12-30 07:59:12
4  (2015, 12, 30, 23, 59, 12, 0, 362, 0) 2015-12-30 23:59:12

Explanation :

  1. calendar.timegm converts each time-tuple to a timestamp. Unlike time.mktime , it interprets the time-tuple as being in UTC, not local time.

  2. apply calls tuple_to_timestamp for each row of df['orig'] .

  3. The nice thing about timestamps is that they are numeric, so you can then use numerical methods such as Series.interpolate to fill in NaNs with interpolated values. Note that the two NaNs do not get filled with same interpolated value; their values are linearly interpolated based on their position as given by ts.index .

  4. pd.to_datetime converts to timestamps to dates.

  1. When working with dates and times in pandas, convert them to a pandas timestamp using pandas.to_datetime . To use this function, we will convert the list into a string with just the date and time elements. For your case, values that are not lists of length 9 will be considered bad and are replaced with a empty string '' .

     #convert list into string with date & time #only elements with lists of length 9 will be parsed dates_df = df_date_ugly.applymap(lambda x: "{0}/{1}/{2} {3}:{4}:{5}".format(x[0],x[1],x[2], x[3], x[4], x[5]) if len(x)==9 else '') #convert to a pandas timestamp dates_df = pd.to_datetime(dates_df['date'], errors = 'coerce')) date 0 2015-12-29 00:30:50 1 2015-12-28 23:59:12 2 NaT 3 NaT 4 2015-12-28 23:59:12 
  2. Find the indices where the dates are missing use pd.isnull() :

     >>>missing = pd.isnull(dates_df['date']).index >>>missing Int64Index([2, 3], dtype='int64') 
  3. To set the missing date as the midpoint between 2 dates:

     start_date = dates_df.iloc[0,:] end_date = dates_df.iloc[4,:] missing_date = start_date + (end_date - start_date)/2 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM