简体   繁体   English

比使用列表更有效地构建数据集的方法

[英]More efficient way to build dataset then using lists

I am building a dataset for a squence to point conv network, where each window is moved by one timestep.我正在为序列到点转换网络构建一个数据集,其中每个 window 移动一个时间步长。 Basically this loop is doing it:基本上这个循环正在这样做:

    x_train = []
    y_train = []


    for i in range(window,len(input_train)):
        x_train.append(input_train[i-window:i].tolist())
        y = target_train[i-window:i]
        y = y[int(len(y)/2)]
        y_train.append(y)

When im using a big value for window, eg 500 i get a memory error.当我对 window 使用较大的值时,例如 500,我得到 memory 错误。 Is there a way to build the training dataset more efficiently?有没有办法更有效地构建训练数据集?

You should use pandas .您应该使用pandas It still might take too much space, but you can try:它仍然可能占用太多空间,但您可以尝试:

import pandas as pd

# if input_train isn't a pd.Series already
input_train = pd.Series(input_train)

rolling_data = (w.reset_index(drop=True) for w in input_train.rolling(window))
x_train = pd.DataFrame(rolling_data).iloc[window - 1:]
y_train = target_train[window//2::window]

Some explanations with an example:举例说明:

Assuming a simple series:假设一个简单的系列:

>>> input_train = pd.Series([1, 2, 3, 4, 5])
>>> input_train
0    1
1    2
2    3
3    4
4    5
dtype: int64

We can create a dataframe with the windowed data like so:我们可以使用窗口数据创建一个 dataframe,如下所示:

>>> pd.DataFrame(input_train.rolling(2))
     0    1    2    3    4
0  1.0  NaN  NaN  NaN  NaN
1  1.0  2.0  NaN  NaN  NaN
2  NaN  2.0  3.0  NaN  NaN
3  NaN  NaN  3.0  4.0  NaN
4  NaN  NaN  NaN  4.0  5.0

The problem with this is that values in each window have their own indices (0 has 0, 1 has 1, etc.) so they end up in corresponding columns.这样做的问题是每个 window 中的值都有自己的索引(0 有 0,1 有 1 等),因此它们最终会出现在相应的列中。 We can fix this by resetting indices for each window:我们可以通过重置每个 window 的索引来解决这个问题:

>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2))
     0    1
0  1.0  NaN
1  1.0  2.0
2  2.0  3.0
3  3.0  4.0
4  4.0  5.0

The only thing left to do is remove the first window - 1 number of rows because they are not complete (that is just how rolling works):剩下要做的就是删除第一个window - 1行,因为它们不完整(这就是rolling的工作方式):

>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2)).iloc[2-1:] # .iloc[1:]
     0    1
1  1.0  2.0
2  2.0  3.0
3  3.0  4.0
4  4.0  5.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 连接大量列表的更有效方法? - More efficient way of concatenating a huge amount of lists? 在 python 中处理大列表的更有效方法? - More efficient way to handle big lists in python? 是否有更多pythonic /更有效的方法来循环包含列表的字典而不是使用for循环? - Is there a more pythonic/more efficient way to loop through dictionary containing lists rather than using for loops? 是否有更有效的方法将自定义 function 文本切片器应用于整个数据集? - Is there a more efficient way to apply the custom function text slicer to the entire dataset? 是否有更有效的方法将此自定义 function 应用于整个数据集? - Is there a more efficient way to apply this custom function to the entire dataset? 有没有更有效的方法从 hdf5 数据集中检索批次? - Is there a more efficient way of retrieving batches from a hdf5 dataset? 有没有更有效的方法来聚合数据集并在 Python 或 R 中计算频率? - is there a more efficient way to aggregate a dataset and calculate frequency in Python or R? 寻找从python中的yelp评论数据集构建矩阵的有效方法 - Looking for efficient way to build matrix from yelp review dataset in python 列表列表中的平均列表 - 有没有更有效的方法? - Average List from a List of Lists - Is there a more efficient way? python列表中的pandas数据框以更有效的方式 - pandas data frame from python lists in more efficient way
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM