简体   繁体   English

如何将多处理应用于滑动 window

[英]How to apply multiprocessing to a sliding window

I am creating 1 million values and splitting them into train and test by using a sliding window that uses a window of length 1000 values and slides by one value every time.我正在创建 100 万个值,并使用滑动 window 将它们拆分为训练和测试,该滑动 window 使用长度为 1000 个值的 window 并且每次滑动一个值。

For example, The first process would be the splitting of first 1000 values to the train, and 1001st value to the test.例如,第一个过程是将前 1000 个值拆分到训练中,将第 1001 个值拆分到测试中。 The second process would be valued from 2 to 1001 in the train and 1002nd in the test, and so on.第二个过程在训练中的值从 2 到 1001,在测试中的值为 1002,依此类推。

It takes 76.28 seconds to run the script.运行脚本需要 76.28 秒。 I used timeit to measure this.我用timeit来衡量这个。

Now, I want to reduce this time by running a sliding window using multiple processors.现在,我想通过使用多个处理器运行滑动 window 来减少这个时间。 I used Pool from multiprocessing with 4 CPUs but, it didn't change the performance at all.我使用了具有 4 个 CPU 的multiprocessing Pool ,但它根本没有改变性能。 I am wondering what could be a better way here?我想知道这里有什么更好的方法?

code:代码:

from multiprocessing import Process
from multiprocessing import Pool
import numpy as np
import pandas as pd
from timeit import default_timer as timer

start = timer()

data = list(range(1_000_000))
window_size = 1_000
splits = []

def sw(window_size, data):
    for i in range(window_size, len(data)):
        train = np.array(data[i - window_size:i])
        test = np.array(data[i:i + 1])
        splits.append(('TRAIN:', train, 'TEST:', test))

#  sw(window_size, data)
#  print(splits)

if __name__ == '__main__':
    p= Pool(4)
    p = Process(target=sw, args=(window_size, data))
    p.start()
    p.join()

end = timer()
print(end - start)

Indeed as the comments point out, all you do is create a Pool named p , then reassign that variable to a process result.事实上,正如评论所指出的,您所做的就是创建一个名为p的池,然后将该变量重新分配给一个流程结果。 I rewrote your sliding window function a little.我稍微改写了您的滑动 window function。 A simple way to parallelize independent tasks is to specify what you want to do to one item and then just use the map functor.并行化独立任务的一种简单方法是指定要对一项执行的操作,然后使用 map 函子。 Benchmarks were performed on an Intel core i5-6300U@2.40GHz (dual core with hyperthreading).基准测试是在Intel core i5-6300U@2.40GHz (双核超线程)上进行的。

from multiprocessing import Process
from multiprocessing import Pool
import numpy as np
from timeit import default_timer as timer

NUM_EL = 1_000_000
WINDOW_SIZE = 1000
DATA = list(range(NUM_EL))


def window(start_idx, window_size=WINDOW_SIZE, data=DATA):
    _train = np.array(data[start_idx:start_idx + window_size])
    _test = np.array(data[start_idx + window_size + 1])
    # return something useful here
    return start_idx


if __name__ == '__main__':
    STARTS = list(range(NUM_EL - WINDOW_SIZE - 1))

    start = timer()
    result_single = list(map(window, STARTS))
    end = timer()
    print("Single core: ", end - start)

    start = timer()
    with Pool(4) as p:
       result_multi = p.map(window, STARTS)

    end = timer()
    print(result_single == result_multi)
    print("Multiprocessing: ", end - start)
>>> Single core:  99.9821742
>>> Multiprocessing:  38.71327739999998

Note: This code most likely does NOT work in any environment using IPython .注意:此代码很可能不适用于任何使用IPython的环境。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM