简体   繁体   中英

How to fill missing date in timeSeries

Here's what my data looks like:

我的数据看起来像这样

There are daily records, except for a gap from 2017-06-12 to 2017-06-16.

df2['timestamp'] = pd.to_datetime(df['timestamp'])
df2['timestamp'] = df2['timestamp'].map(lambda x: 
datetime.datetime.strftime(x,'%Y-%m-%d'))
df2 = df2.convert_objects(convert_numeric = True)
df2 = df2.groupby('timestamp', as_index = False).sum()

I need to fill this missing gap and others with values for all fields (eg timestamp , temperature , humidity , light , pressure , speed , battery_voltage , etc...).

How can I accomplish this with Pandas?

This is what I have done before

weektime = pd.date_range(start = '06/04/2017', end = '12/05/2017', freq = 'W-SUN')
df['week'] = 'nan'
df['weektemp'] = 'nan'
df['weekhumidity'] = 'nan'
df['weeklight'] = 'nan'
df['weekpressure'] = 'nan'
df['weekspeed'] = 'nan'
df['weekbattery_voltage'] = 'nan'

for i in range(0,len(weektime)):
    df['week'][i+1] = weektime[i]
    df['weektemp'][i+1] = df['temperature'].iloc[7*i+1:7*i+7].sum()
    df['weekhumidity'][i+1] = df['humidity'].iloc[7*i+1:7*i+7].sum()
    df['weeklight'][i+1] = df['light'].iloc[7*i+1:7*i+7].sum()
    df['weekpressure'][i+1] = df['pressure'].iloc[7*i+1:7*i+7].sum()
    df['weekspeed'][i+1] = df['speed'].iloc[7*i+1:7*i+7].sum()
    df['weekbattery_voltage'][i+1] = 
df['battery_voltage'].iloc[7*i+1:7*i+7].sum()
     i = i + 1

The value of sum is not correct. Cause the value of 2017-06-17 is a sum of 2017-06-12 to 2017-06-16. I do not want to add them again. This gap is not only one gap in the period. I want to fill all of them.

Here is a function I wrote that might be helpful to you. It looks for inconsistent jumps in time and fills them in. After using this function, try using a linear interpolation function (pandas has a good one) to fill in your null data values. Note: Numpy arrays are much faster to iterate over and manipulate than Pandas dataframes, which is why I switch between the two.

import numpy as np
import pandas as pd

data_arr = np.array(your_df)
periodicity = 'daily'

def fill_gaps(data_arr, periodicity):
    rows = data_arr.shape[0]
    data_no_gaps = np.copy(data_arr) #avoid altering the thing you're iterating over
    data_no_gaps_idx = 0

    for row_idx in np.arange(1, rows): #iterate once for each row (except the first record; nothing to compare)
        oldtimestamp_str = str(data_arr[row_idx-1, 0]) 
        oldtimestamp = np.datetime64(oldtimestamp_str)  

        currenttimestamp_str = str(data_arr[row_idx, 0])
        currenttimestamp = np.datetime64(currenttimestamp_str)

        period = currenttimestamp - oldtimestamp

        if period != np.timedelta64(900,'s') and period != np.timedelta64(3600,'s') and period != np.timedelta64(86400,'s'):                                
            if periodicity == 'quarterly':
                desired_period = 900
            elif periodicity == 'hourly':
                desired_period = 3600
            elif periodicity == 'daily':
                desired_period = 86400

            periods_missing = int(period / np.timedelta64(desired_period,'s'))
            for missing in np.arange(1, periods_missing):
                new_time_orig = str(oldtimestamp + missing*(np.timedelta64(desired_period,'s')))
                new_time = new_time_orig.replace('T', ' ')
                data_no_gaps = np.insert(data_no_gaps, (data_no_gaps_idx + missing), 
                                 np.array((new_time, np.nan, np.nan, np.nan, np.nan, np.nan)), 0) # INSERT VALUES YOU WANT IN THE NEW ROW

            data_no_gaps_idx += (periods_missing-1) #incriment the index (zero-based => -1) in accordance with added rows

        data_no_gaps_idx += 1 #allow index to change as we iterate over original data array (main for loop)

    #create a dataframe:
    data_arr_no_gaps = pd.DataFrame(data=data_no_gaps, index=None,columns=['Time', 'temp', 'humidity', 'light', 'pressure', 'speed'])

    return data_arr_no_gaps  

Fill time gaps and nulls

Use the function below to ensure expected date sequence exists, and then use forward fill to fill in nulls.

import pandas as pd
import os

def fill_gaps_and_nulls(df, freq='1D'):
    '''
    General steps:
        A) check for extra dates (out of expected frequency/sequence)
        B) check for missing dates (based on expected frequency/sequence)
        C) use forwardfill to fill nulls
        D) use backwardfill to fill remaining nulls
        E) append to file    
    '''
    
    #rename the timestamp to 'date'
    df.rename(columns={"timestamp": "date"})
    
    #sort to make indexing faster
    df = df.sort_values(by=['date'], inplace=False)
    
    #create an artificial index of dates at frequency = freq, with the same beginning and ending as the original data
    all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq=freq)

    #record column names
    df_cols = df.columns

    #delete ffill_df.csv so we can begin anew
    try:
        os.remove('ffill_df.csv')
    except FileNotFoundError:
        pass

    #check for extra dates and/or dates out of order. print warning statement for log
    extra_dates = set(df.date).difference(all_dates)
    
    #if there are extra dates (outside of expected sequence/frequency), deal with them
    if len(extra_dates) > 0:
        #############################
        #INSERT DESIRED BEHAVIOR HERE        
        print('WARNING: Extra date(s):\n\t{}\n\t    Shifting highlighted date(s) back by 1 day'.format(extra_dates))
        for date in extra_dates:
            #shift extra dates back one day
            df.date[df.date == date] = date - pd.Timedelta(days=1)
        #############################

    #check the artificial date index against df to identify missing gaps in time and fill them with nulls
    gaps = all_dates.difference(set(df.date))
    print('\n-------\nWARNING: Missing dates: {}\n-------\n'.format(gaps))

    #if there are time gaps, deal with them
    if len(gaps) > 0:
        #initialize df of correct size, filled with nulls
        gaps_df = pd.DataFrame(index=gaps, columns=df_cols.drop('date')) #len(index) sets number of rows
        #give index a name
        gaps_df.index.name = 'date'
        #add the region and type
        gaps_df.region = r
        gaps_df.type = t
        #remove that index so gaps_df and df are compatible
        gaps_df.reset_index(inplace=True)                
        #append gaps_df to df
        new_df = pd.concat([df, gaps_df])
        #sort on date
        new_df.sort_values(by='date', inplace=True)

    #fill nulls
    new_df.fillna(method='ffill', inplace=True)
    new_df.fillna(method='bfill', inplace=True)

    #append to file
    new_df.to_csv('ffill_df.csv', mode='a', header=False, index=False) 
    
    return df_cols, regions, types, all_dates

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM