[英]More efficient way to build dataset then using lists
I am building a dataset for a squence to point conv network, where each window is moved by one timestep.我正在为序列到点转换网络构建一个数据集,其中每个 window 移动一个时间步长。 Basically this loop is doing it:基本上这个循环正在这样做:
x_train = []
y_train = []
for i in range(window,len(input_train)):
x_train.append(input_train[i-window:i].tolist())
y = target_train[i-window:i]
y = y[int(len(y)/2)]
y_train.append(y)
When im using a big value for window, eg 500 i get a memory error.当我对 window 使用较大的值时,例如 500,我得到 memory 错误。 Is there a way to build the training dataset more efficiently?有没有办法更有效地构建训练数据集?
You should use pandas
.您应该使用pandas
。 It still might take too much space, but you can try:它仍然可能占用太多空间,但您可以尝试:
import pandas as pd
# if input_train isn't a pd.Series already
input_train = pd.Series(input_train)
rolling_data = (w.reset_index(drop=True) for w in input_train.rolling(window))
x_train = pd.DataFrame(rolling_data).iloc[window - 1:]
y_train = target_train[window//2::window]
Some explanations with an example:举例说明:
Assuming a simple series:假设一个简单的系列:
>>> input_train = pd.Series([1, 2, 3, 4, 5])
>>> input_train
0 1
1 2
2 3
3 4
4 5
dtype: int64
We can create a dataframe with the windowed data like so:我们可以使用窗口数据创建一个 dataframe,如下所示:
>>> pd.DataFrame(input_train.rolling(2))
0 1 2 3 4
0 1.0 NaN NaN NaN NaN
1 1.0 2.0 NaN NaN NaN
2 NaN 2.0 3.0 NaN NaN
3 NaN NaN 3.0 4.0 NaN
4 NaN NaN NaN 4.0 5.0
The problem with this is that values in each window have their own indices (0 has 0, 1 has 1, etc.) so they end up in corresponding columns.这样做的问题是每个 window 中的值都有自己的索引(0 有 0,1 有 1 等),因此它们最终会出现在相应的列中。 We can fix this by resetting indices for each window:我们可以通过重置每个 window 的索引来解决这个问题:
>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2))
0 1
0 1.0 NaN
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0
The only thing left to do is remove the first window - 1
number of rows because they are not complete (that is just how rolling
works):剩下要做的就是删除第一个window - 1
行,因为它们不完整(这就是rolling
的工作方式):
>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2)).iloc[2-1:] # .iloc[1:]
0 1
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.