简体   繁体   English

为变体长度时间序列训练自动编码器 - Tensorflow

[英]Training autoencoder for variant length time series - Tensorflow

I am trying to train a LSTM model to reconstruct time series data.我正在尝试训练 LSTM model 来重建时间序列数据。 I have a data set of ~1800 univariant time-series.我有一个约 1800 个单变量时间序列的数据集。 Basically I'm trying to solve a problem similar to this one Anomaly detection in ECG plots , but my time series have different lengths.基本上,我正在尝试解决与ECG plots 中的异常检测类似的问题,但我的时间序列有不同的长度。

I used this approach to deal with variant length: How to apply LSTM-autoencoder to variant-length time-series data?我使用这种方法来处理变体长度: How to apply LSTM-autoencoder to variant-length time-series data? and this approach to split the input data based on shape: Keras misinterprets training data shape这种基于形状分割输入数据的方法: Keras misinterprets training data shape

When looping over the data and fitting a model for every shape.当循环数据并为每个形状拟合 model 时。 is the model eventually only based on the last shape it trained on or is it using all the data to train the final model? model 最终是仅基于它训练的最后一个形状,还是使用所有数据来训练最终的 model?

How would I train the model on all input data regardless shape of data?无论数据的形状如何,我将如何在所有输入数据上训练 model? I know I can add padding but I am trying to use the data as is at this point.我知道我可以添加填充,但我正在尝试按原样使用数据。 Any suggestions or other approaches to deal with different length on timeseries?有什么建议或其他方法来处理时间序列上的不同长度吗? (It is not an issue of time sampling it is more of one timeseries started recording on day X and some only on day X+100) (这不是时间采样的问题,更多的是在第 X 天开始记录的一个时间序列,有些仅在 X+100 天开始记录)

Here is the code I am using for my autoencoder:这是我用于自动编码器的代码:

import keras.backend as K
from keras.layers import (Input, Dense, TimeDistributed, LSTM, GRU, Dropout, merge,
                      Flatten, RepeatVector, Bidirectional, SimpleRNN, Lambda)


def encoder(model_input, layer, size, num_layers, drop_frac=0.0, output_size=None,
        bidirectional=False):
    """Encoder module of autoencoder architecture"""
   if output_size is None:
      output_size = size
   encode = model_input
   for i in range(num_layers):
       wrapper = Bidirectional if bidirectional else lambda x: x
       encode = wrapper(layer(size, name='encode_{}'.format(i),
                           return_sequences=(i < num_layers - 1)))(encode)
       if drop_frac > 0.0:
          encode = Dropout(drop_frac, name='drop_encode_{}'.format(i))(encode)
  encode = Dense(output_size, activation='linear', name='encoding')(encode)
  return encode


def repeat(x):

   stepMatrix = K.ones_like(x[0][:,:,:1]) #matrix with ones, shaped as (batch, steps, 1)
   latentMatrix = K.expand_dims(x[1],axis=1) #latent vars, shaped as (batch, 1, latent_dim)

   return K.batch_dot(stepMatrix,latentMatrix)


def decoder(encode, layer, size, num_layers, drop_frac=0.0, aux_input=None,
        bidirectional=False):
   """Decoder module of autoencoder architecture"""

   decode = Lambda(repeat)([inputs,encode])
   if aux_input is not None:
       decode = merge([aux_input, decode], mode='concat')

   for i in range(num_layers):
       if drop_frac > 0.0 and i > 0:  # skip these for first layer for symmetry
           decode = Dropout(drop_frac, name='drop_decode_{}'.format(i))(decode)
       wrapper = Bidirectional if bidirectional else lambda x: x
       decode = wrapper(layer(size, name='decode_{}'.format(i),
                           return_sequences=True))(decode)

   decode = TimeDistributed(Dense(1, activation='linear'), name='time_dist')(decode)
   return decode


inputs = Input(shape=(None, 1))
encoded = encoder(inputs,LSTM,128, 2, drop_frac=0.0, output_size=None, bidirectional=False)
decoded = decoder(encoded, LSTM, 128, 2, drop_frac=0.0, aux_input=None,
          bidirectional=False,)


sequence_autoencoder = Model(inputs, decoded)
sequence_autoencoder.compile(optimizer='adam', loss='mae')


trainByShape = {}
for item in train_data:
  if item.shape in trainByShape:
    trainByShape[item.shape].append(item)
  else:
    trainByShape[item.shape] = [item]

for shape in trainByShape:
    modelHistory =sequence_autoencoder.fit(
              np.asarray(trainByShape[shape]), 
              np.asarray(trainByShape[shape]),
              epochs=100, batch_size=1, validation_split=0.15)

use a bidirectional lstm and increase the number of parameters to gain accuracy.使用双向 lstm 并增加参数数量以获得准确性。 I increased the latent_dim to 1000 and it fit the data closely.我将latent_dim 增加到1000,它非常适合数据。 More hardware and more memory.更多硬件和更多 memory。

def create_dataset(dataset, look_back=3):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back)]
        dataX.append(a)
        dataY.append(dataset[i + look_back])
    return np.array(dataX), np.array(dataY)

COLUMNS=['Open']
dataset=eqix_df[COLUMNS]
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(np.array(dataset).reshape(-1,1))

train_size = int(len(dataset) * 0.70)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size], dataset[train_size:len(dataset)]

look_back=10
trainX=[]
testX=[]
y_train=[]

trainX, y_train = create_dataset(train, look_back)
testX, y_test = create_dataset(test, look_back)

X_train = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
X_test = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
latent_dim=700
n_future=1

model = Sequential()

model.add(Bidirectional(LSTM(units=latent_dim, return_sequences=True, 
                             input_shape=(X_train.shape[1], 1))))

#LSTM 1
model.add(Bidirectional(LSTM(latent_dim,return_sequences=True,dropout=0.4,recurrent_dropout=0.4,name='lstm1'))) 

#LSTM 2 
model.add(Bidirectional(LSTM(latent_dim,return_sequences=True,dropout=0.2,recurrent_dropout=0.4,name='lstm2')))

#LSTM 3 
model.add(Bidirectional(LSTM(latent_dim, return_sequences=False,dropout=0.2,recurrent_dropout=0.4,name='lstm3')))

model.add(Dense(units = n_future))

model.compile(optimizer="adam", loss="mean_squared_error", metrics=["acc"])

history=model.fit(X_train, y_train,epochs=50,verbose=0)

plt.plot(history.history['loss'])
plt.title('loss accuracy')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

#print(X_test)
prediction = model.predict(X_test)

# shift train predictions for plotting
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(prediction)+look_back, :] = prediction
# shift test predictions for plotting
#plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot, color='red')
#plt.plot(testPredictPlot)
#plt.legend(['Actual','Train','Test'])
x=np.linspace(look_back,len(prediction)+look_back,len(y_test))
plt.plot(x,y_test)
plt.show()

Keras LSTM implementation expect a input of type: (Batch, Timesteps, Features) . Keras LSTM 实现需要输入类型: (Batch, Timesteps, Features)

One solution would be to set Timesteps = 1 and pass the sequence lengths as the Batch dimensions.一种解决方案是设置Timesteps = 1并将序列长度作为Batch维度传递。

If the sampling procedure is the same (no need for resampling), and the difference in length only comes from when the recording time start (X+100 instead of X), I would try to get rid off the lag in the pre-processing stages to get the section of interest only.如果采样过程相同(无需重新采样),并且长度差异仅来自记录时间开始时(X + 100而不是X),我会尝试摆脱预处理中的滞后阶段只获得感兴趣的部分。

Part 1: Plotting the irregular heartbeat.第 1 部分:绘制不规则心跳。 Part 2 will be setting up the LSTM network to classify incoming heartbeat voltage to predict irregular beat patterns.第 2 部分将设置 LSTM 网络以对传入的心跳电压进行分类,以预测不规则的心跳模式。 I think the rows can be feed into the lstm as sequence 10 data points at a time.我认为这些行可以一次作为序列 10 个数据点输入 lstm。 the row will be the batchsize.该行将是批量大小。

from scipy.io import arff
import pandas as pd
from scipy.misc import electrocardiogram
import matplotlib.pyplot as plt
import numpy as np
data = arff.loadarff('ECG5000_TRAIN.arff')
df = pd.DataFrame(data[0])

#for column in df.columns:
#    print(column)
    
columns=[x for x in df.columns if x!="target"]    
print(columns)

#print(df[df.target == "b'1'"].drop(labels='target', axis=1).mean(axis=0).to_numpy())
normal=df.query("target==b'1'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
rOnT=df.query("target==b'2'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
pcv=df.query("target==b'3'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
sp=df.query("target==b'4'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
ub=df.query("target==b'5'").drop(labels='target', axis=1).mean(axis=0).to_numpy()

plt.plot(normal,label="Normal")
plt.plot(rOnT,label="R on T",alpha=.3)
plt.plot(pcv, label="PCV",alpha=.3)
plt.plot(sp, label="SP",alpha=.3)
plt.plot(ub, label="UB",alpha=.3)
plt.legend()
plt.title("ECG")
plt.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM