简体   繁体   中英

Running models on multiple cores with different data sets in Python

I have a folder containing multiple datasets and I want to run a model over these datasets and distribute the load across multiple cores, hopefully, to increase the overall run time of the data processing.

My computer has 8 cores. This was my first attempt below, it's only really a sketch but using htop , I can see that only 1 core is being employed for this job. Multi-core newbie here.

import pandas as pd
import multiprocessing
import os
from library_example import model_example

def worker(file_):
    to_save = pd.Series()
    with open(file_,'r') as f_open:
        data = f_open.read()

    # Run model 
    model_results = model_example(file_)

    # Save results in DataFrame
    to_save.to_csv(file_[:-4]+ "_results.csv", model_results )

file_location_ = "/home/datafiles/"
if __name__ == '__main__':
    for filename in os.listdir(file_location_):
        p = multiprocessing.Process(target=worker, args=(file_location_ + filename,))
        p.start()
        p.join()

Try moving out the p.join() . That will wait for the process to complete which effectively makes this a serial process as you kick off the process (ie start ) and then wait for each one (ie join ). Instead you can try something like this:

# construct the workers
workers = [multiprocessing.Process(target=worker, args=(file_location_ + filename,)) for filename in os.listdir(file_location_)]

# start them
for proc in workers:
    proc.start()

# now we wait for them
for proc in workers:
    proc.join()

(I didn't try running this in your code but something like that should work.)

EDIT If you want to limit the number of workers/processes then I'd recommend just using a Pool . You can specify how many processes to use and then map(..) the arguments to those processes. Example:

# construct a pool of workers
pool = multiprocessing.Pool(6)
pool.map(worker, [file_location_ + filename for filename in os.listdir(file_location_)])
pool.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM