简体   繁体   中英

Python multiprocessing for dataset preparation

I'm looking for shorter ways to prepare my dataset for a machine-learning task. I found that the multiprocessing library might helpful. However, because I'm a newbie in multiprocessing, I couldn't find a proper way.

I first wrote some codes like below:

class DatasetReader:
    def __init__(self):
        self.data_list = Read_Data_from_file
        self.data = []

    def _ready_data(self, ex, idx):
        # Some complex functions that takes several minutes

    def _dataset_creator(self, queue):
        for idx, ex in enumerate(self.data_list):
            queue.put(self._ready_data(ex, idx))

    def _dataset_consumer(self, queue):
        total_mem = 0.0
        t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
        for idx in t:
            ins = queue.get()
            self.data.append(ins)
            gc.collect()

    def _build_dataset(self):
        queue = Queue()
        creator = Process(target=self._dataset_creator, args=(queue,))
        consumer = Process(target=self._dataset_consumer, args=(queue,))
        creator.start()
        consumer.start()

        queue.close()
        queue.join_thread()

        creator.join()
        consumer.join()

However, in my opinion, because the _dataset_creator processes data (here _ready_data ) in serial manner , this would not be helpful for reducing time consumption.

So, I modified the code to generate multiple processes that process one datum:

class DatasetReader:
    def __init__(self):
        self.data_list = Read_Data_from_file
        self.data = []

    def _ready_data(self, ex, idx):
        # Some complex functions that takes several minutes

    def _dataset_creator(self, ex, idx, queue):
        queue.put(self._ready_data(ex, idx))

    def _dataset_consumer(self, queue):
        total_mem = 0.0
        t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
        for idx in t:
            ins = queue.get()
            self.data.append(ins)
            gc.collect()

    def _build_dataset(self):
        queue = Queue()
        for idx, ex in enumerate(self.data_list):
            p = Process(target=self._dataset_creator, args=(ex, idx, queue,))
            p.start()
        consumer = Process(target=self._dataset_consumer, args=(queue,))
        consumer.start()

        queue.close()
        queue.join_thread()

        consumer.join()

However, this returns me errors:

Process Process-18:  
Traceback ~~~  
RuntimeError: can't start new thread  
Traceback ~~~  
OSError: [Errno 12] Cannot allocate memory  

Could you help me to process complex data in a parallel way?

EDIT 1:

Thanks to @tdelaney, I can reduce the time consumption by generating self.num_worker processes (16 in my experiment):

    def _dataset_creator(self, pid, queue):
        for idx, ex in list(enumerate(self.data_list))[pid::self.num_worker]:
            queue.put(self._ready_data(ex, idx))

    def _dataset_consumer(self, queue):
        t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ', bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) [{elapsed}<{remaining},{rate_fmt}{postfix}]')
        for _ in t:
            ins = queue.get()
            self.data[ins['idx']] = ins

    def _build_dataset(self):
        queue = Queue()
        procs = []
        for pid in range(self.num_worker):
            p = Process(target=self._dataset_creator, args=(pid, queue,))
            procs.append(p)
            p.start()
        consumer = Process(target=self._dataset_consumer, args=(queue,))
        consumer.start()

        queue.close()
        queue.join_thread()

        for p in procs:
            p.join()
        consumer.join()

I'm trying to sketch out what a solution with a multiprocessing pool would look like. I got rid of the consumer process completely because it looks like the parent process is just waiting anyway (and needs the data eventually) so it can be the consumer. So, I set up a pool and use imap_unordered to handle passing the data to the worker.

I guessed that the data processing doesn't really need the DatasetReader at all and moved it out to its own function. On Windows, either the entire DataReader object is serialized to the subprocess (including data you don't want) or the child version of the object is incomplete and may crash when you try to use it.

Either way, changes made to a DatasetReader object in the child processes aren't seen in the parent. This can be unexpected if the parent is dependent on updated state in that object. Its best to severely bracket what's happening in subprocesses, in my opinion.

from multiprocessing import Pool, get_start_method, cpu_count

# moved out of class (assuming it is not class dependent) so that
# the entire DatasetReader object isn't pickled and sent to
# the child on spawning systems like Microsoft Windows

def _ready_data(idx_ex):
    idx, ex = idx_ex
    # Some complex functions that take several minutes
    result = complex_functions(ex)
    return (idx, result)


class DatasetReader:

    def __init__(self):
        self.data_list = Read_Data_from_file
        self.data = [None] * len(data_list)

    def _ready_data_fork(self, idx):
        # on forking system, call worker with object data
        return _ready_data((idx, self.data_list[idx]))

    def run(self):

        t = tqdm(range(self.num_data), total=self.num_data, desc='Building Dataset ',
            bar_format='{desc}:{percentage:3.0f}% ({n_fmt}/{total_fmt}) '
                '[{elapsed}<{remaining},{rate_fmt}{postfix}]')

        pool = Pool(min(cpu_count, len(self.data_list)))
        if get_start_method() == 'fork':
            # on forking system, self.data_list is in child process and
            # we only pass the index
            result_iter = pool.imap_unordered(self._ready_data_fork, 
                    (idx for idx in range(len(data_list))),
                    chunksize=1)
        else:
            # on spawning system, we need to pass the data
            result_iter = pool.imap_unordered(_ready_data,
                    enumerate(self.data_list,
                    chunksize=1)

        for idx, result in result_iter:
            next(t)
            self.data[idx] = result

        pool.join()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM