繁体   English   中英

如何从 Pandas 数据帧为 LSTM 模型创建输入样本?

[英]How to create input samples from pandas dataframe for a LSTM model?

我正在尝试创建一个LSTM模型,它可以让我购买或不购买二进制输出。 我的数据格式为: [date_time, close, volume] ,数百万行。 我坚持将数据格式化为 3-D; 样本、时间步长、特征。

我用熊猫来读取数据。 我想对其进行格式化,以便我可以获得 4000 个样本,每个样本具有 400 个时间步长,以及两个特征(关闭和音量)。 有人可以建议如何执行此操作吗?

编辑:我正在按照建议使用 TimeseriesGenerator,但我不确定如何检查我的序列并将输出 Y 替换为我自己的二进制购买输出。

df = normalize_data(df)

print("Creating sequences for NN \n")
targets = df.drop('date_time', 1)
train = keras.preprocessing.sequence.TimeseriesGenerator(df, targets, 1, sampling_rate=1, stride=1,
                                                         start_index=0, end_index=int(len(df.index)*0.8),
                                                         shuffle=True, reverse=False, batch_size=time_steps)

这运行没有错误,但现在输出是输入时间序列后的第一个收盘值。

编辑 2:到目前为止,我的代码如下所示:

df = data.normalize_data(df)
targets = df.iloc[:, 3]  # Buy signal target

df.drop('y1', axis=1, inplace=True)
df.drop('y2', axis=1, inplace=True)

train = TimeseriesGenerator(df, targets, length=1, sampling_rate=1, stride=1,
                            start_index=0, end_index=int(len(df.index) * 0.8),
                            shuffle=True, reverse=False, batch_size=time_steps)

# number of samples
print("Samples: " + str(len(train)))
x, y = train[0]
print(str(x))

输出如下:

Samples: 8
Traceback (most recent call last):
File "/home/stian/.local/lib/python3.6/site- 
packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in 
pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: range(418, 419)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./main.py", line 94, in <module>
data_menu()
File "./main.py", line 42, in data_menu
data_menu()
File "./main.py", line 56, in data_menu
nn_menu()
File "./main.py", line 76, in nn_menu
nn.nn_gen(pre_processed_data)
File "/home/stian/git/stian9k/nn.py", line 33, in nn_gen
x, y = train[0]
File "/home/stian/.local/lib/python3.6/site-packages/keras_preprocessing/sequence.py", line 378, in __getitem__
samples[j] = self.data[indices]
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: range(418, 419)

所以看起来即使我从生成器中得到 8 个对象我也无法查找它们。 如果我测试类型: print(str(type(train))) 我得到 TimeseriesGenerator 对象。 再次感谢任何建议。

编辑 3:事实证明 timeseriesgenerator 不喜欢 pandas 数据帧。 通过转换为 numpy 数组以及将 pandas 时间戳类型转换为 float 来解决该问题。

为此,您可以简单地使用Keras TimeseriesGenerator 您可以轻松设置长度(即每个样本中的时间步数)、采样率和步幅以对数据进行子采样。

它将返回一个Sequence类的实例,然后您可以将其传递给fit_generator以根据它生成的数据拟合模型。 我强烈建议阅读文档以获取有关此类、其参数和用法的更多信息。

谢谢。 我从数据框中得到了很多疯狂的数字。 在使用它之前用 to_numpy() 转换它解决了这个问题!

input_convertido = df.to_numpy()
output_convertido = df["close"].to_numpy()
gerador = TimeseriesGenerator(input_convertido, output_convertido, length=n_input, batch_size=1, sampling_rate=1)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM