在Python中使用不同数据集在多个内核上运行模型

Question

我有一个包含多个数据集的文件夹，我想在这些数据集上运行一个模型，并在多个内核之间分配负载，以期增加数据处理的整体运行时间。

我的电脑有8个核心。 这是我在下面的第一次尝试，实际上只是一个草图，但是使用htop ，我可以看到该工作仅使用了一个核心。 多核新手在这里。

import pandas as pd
import multiprocessing
import os
from library_example import model_example

def worker(file_):
    to_save = pd.Series()
    with open(file_,'r') as f_open:
        data = f_open.read()

    # Run model 
    model_results = model_example(file_)

    # Save results in DataFrame
    to_save.to_csv(file_[:-4]+ "_results.csv", model_results )

file_location_ = "/home/datafiles/"
if __name__ == '__main__':
    for filename in os.listdir(file_location_):
        p = multiprocessing.Process(target=worker, args=(file_location_ + filename,))
        p.start()
        p.join()

Answer 1

尝试移出p.join() 。 这将等待该过程完成，从而在您启动该过程（即start ）然后等待每个过程（即join ）时有效地使其成为一个串行过程。 相反，您可以尝试如下操作：

# construct the workers
workers = [multiprocessing.Process(target=worker, args=(file_location_ + filename,)) for filename in os.listdir(file_location_)]

# start them
for proc in workers:
    proc.start()

# now we wait for them
for proc in workers:
    proc.join()

（我没有尝试在您的代码中运行它，但是类似的东西应该可以工作。）

编辑如果您想限制工作者/进程的数量，那么我建议您仅使用Pool 。 您可以指定要使用多少个进程，然后将参数map(..)到这些进程。 例：

# construct a pool of workers
pool = multiprocessing.Pool(6)
pool.map(worker, [file_location_ + filename for filename in os.listdir(file_location_)])
pool.close()

在Python中使用不同数据集在多个内核上运行模型

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-11-07 13:57:59

在Python中使用不同数据集在多个内核上运行模型

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-11-07 13:57:59

解决方案1
2 已采纳 2017-11-07 13:57:59