简体   繁体   中英

Data cleaning and preparation for Time-Series-LSTM

I need to prepare my Data to feed it into an LSTM for predicting the next day. My Dataset is a time series in seconds but I have just 3-5 hours a day of Data. (I just have this specific Dataset so can't change it) I have Date-Time and a certain Value . Eg:

datetime..............Value      
2015-03-15 12:00:00...1000

2015-03-15 12:00:01....10

.

.

I would like to write a code where I extract eg 4 hours and delete the first extracted hour just for specific months (because this data is faulty). I managed to write a code to extract eg 2 hours for x-Data (Input) and y-Data (Output). I hope I could explain my problem to you.

The Dataset is 1 Year in seconds Data, 6pm-11pm rest is missing. In eg August-November the first hour is faulty data and needs to be deleted.

init = True
for day in np.unique(x_df.index.date):
    temp = x_df.loc[(day + pd.DateOffset(hours=18)):(day + pd.DateOffset(hours=20))]

if len(temp) == 7201:
if init:
    x_df1 = np.array([temp.values])
    init = False
else:
    #print (temp.values.shape)
    x_df1 = np.append(x_df1, np.array([temp.values]), axis=0)
#else:
#if not temp.empty:
    #print (temp.index[0].date(), len(temp))

x_df1 = np.array(x_df1)

print('X-Shape:', x_df1.shape, 
'Y-Shape:', y_df1.shape)
#sample, timesteps and features for LSTM
X-Shape: (32, 7201, 6) Y-Shape: (32, 7201)

My expected result is to have a dataset of eg 4 hours a day where the first hour in eg August, September, and October is deleted. I would be also very happy if there is someone who can also provide me with a nicer code to do so.

Probably not the most efficient solution, but maybe it still fits.

First lets generate some random data for the first 4 months and 5 days per month:

import random
import pandas as pd

df = pd.DataFrame()
for month in range(1,5): #First 4 Months
    for day in range(5,10): #5 Days
        hour = random.randint(18,19)
        minute = random.randint(1,59)
        dt = datetime.datetime(2018,month,day,hour,minute,0)
        dti = pd.date_range(dt, periods=60*60*4, freq='S')
        values = [random.randrange(1, 101, 1) for _ in range(len(dti))]
        df = df.append(pd.DataFrame(values, index=dti, columns=['Value']))

Now let's define a function to filter the first row per day:

def first_value_per_day(df):
    res_df = df.groupby(df.index.date).apply(lambda x: x.iloc[[0]])
    res_df.index = res_df.index.droplevel(0)
    return res_df

and print the results:

print(first_value_per_day(df))

                     Value
2018-01-05 18:31:00     85
2018-01-06 18:25:00     40
2018-01-07 19:54:00     52
2018-01-08 18:23:00     46
2018-01-09 18:08:00     51
2018-02-05 18:58:00      6
2018-02-06 19:12:00     16
2018-02-07 18:18:00     10
2018-02-08 18:32:00     50
2018-02-09 18:38:00     69
2018-03-05 19:54:00    100
2018-03-06 18:37:00     70
2018-03-07 18:58:00     26
2018-03-08 18:28:00     30
2018-03-09 18:34:00     71
2018-04-05 18:54:00      2
2018-04-06 19:16:00    100
2018-04-07 18:52:00     85
2018-04-08 19:08:00     66
2018-04-09 18:11:00     22

So, now we need a list of the specific months, that should be processed, in this case 2 and 3. Now we use the defined function and filter the days for every selected month and loop over those to find the indexes of all values inside the first entry per day +1 hour later and drop them:

MONTHS_TO_MODIFY = [2,3]
HOURS_TO_DROP = 1

fvpd = first_value_per_day(df)
for m in MONTHS_TO_MODIFY:
    fvpdm = fvpd[fvpd.index.month == m]
    for idx, value in fvpdm.iterrows():
        start_dt = idx
        end_dt = idx + datetime.timedelta(hours=HOURS_TO_DROP)
        index_list = df[(df.index >= start_dt) & (df.index < end_dt)].index.tolist()
        df.drop(index_list, inplace=True)

result:

print(first_value_per_day(df))

                     Value
2018-01-05 18:31:00     85
2018-01-06 18:25:00     40
2018-01-07 19:54:00     52
2018-01-08 18:23:00     46
2018-01-09 18:08:00     51
2018-02-05 19:58:00      1
2018-02-06 20:12:00     42
2018-02-07 19:18:00     34
2018-02-08 19:32:00     34
2018-02-09 19:38:00     61
2018-03-05 20:54:00     15
2018-03-06 19:37:00     88
2018-03-07 19:58:00     36
2018-03-08 19:28:00     38
2018-03-09 19:34:00     42
2018-04-05 18:54:00      2
2018-04-06 19:16:00    100
2018-04-07 18:52:00     85
2018-04-08 19:08:00     66
2018-04-09 18:11:00     22

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM