简体   繁体   English

共享输出的python多重处理

[英]python multiprocessing with shared output

I am trying to process a large number of text files and calculating data within them (simple addition). 我正在尝试处理大量文本文件并计算其中的数据(简单添加)。 The problem is this takes a long time, and know that there are some multiprocessing functions in other languages, but have never done anything like this in Python. 问题是这需要长时间,并且知道其他语言中有一些多处理功能,但是从未在Python中做过这样的事情。

Let's say I have a directory with 16,000 files. 假设我有一个包含16,000个文件的目录。 Currently, I open each file individually, bring it into an array in Python, do some manipulation of the data, and then output to a master array (with length 16,000). 目前,我分别打开每个文件,将其放入Python数组中,对数据进行一些处理,然后输出到主数组(长度为16,000)。 Can a multiprocessing function be utilized to run several instances of 'opening the file, processing the data, and output information' to the same array? 可以使用多重处理功能来将“打开文件,处理数据和输出信息”的多个实例运行到同一阵列吗?

The original code is basically like this: 原始代码基本上是这样的:

# path
filepath = /path/to/file

# Get the dir contents
filedir = os.listdir(filepath)

# Pre-allocate large array
large_array = np.zeros(len(filedir))

# Begin loop
for i in range(0,len(filedir)):
    # Define the path to load the text file
    filename = filepath + '/' + filedir[i]

    output = []
    output = function_to_process_filename(filename)

    large_array[i] = output

Where would the multiprocessing / parallel portion go to potentially make the code run faster and what does it look like in Python? 多处理/并行部分将在哪里使代码运行得更快,它在Python中是什么样?

You can use a multiprocessing Pool to submit work to a pool of processes. 您可以使用多处理池将工作提交到流程池。

The map function will take an iterable and split it into chunks of work, on which your function can be applied (see here ): map函数将进行迭代,并将其拆分为多个工作块,可以在其上应用您的函数(请参见此处 ):

This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. 此方法将可迭代项分为多个块,将其作为单独的任务提交给流程池。 The (approximate) size of these chunks can be specified by setting chunksize to a positive integer. 这些块的(大约)大小可以通过将chunksize设置为正整数来指定。

For your example, you could pass a list of file names the map function and a function that will open the file and manipulate it. 对于您的示例,您可以传递文件名列表map函数和一个将打开文件并对其进行操作的函数。 You can pass the processed file contents as result and concatenate everything in the main process. 您可以将处理后的文件内容作为结果传递,并在主过程中将所有内容串联在一起。

So if I understand correctly, what you are looking for is a way to process several jobs in parallel with multiprocessing and each job filling a single python data structure containing the results ? 因此,如果我理解正确,那么您正在寻找一种与多处理并行处理多个作业的方法,而每个作业将填充包含结果的单个python数据结构? I will complete the good previous answer by using indeed map but also multiprocessing.Manager() : 我将通过确实使用map以及multiprocessing.Manager()来完成先前的良好回答:

from multiprocessing import Pool, Manager, ctypes, cpu_count
from functools import partial

# path to dir
dir_path = /path/to/dir

# Get the dir content
files = os.listdir(dir_path)

def processing_func(results_array, filename):
    # process filename
    # add element to results_array

NB_CPU = cpu_count()
# change ctype with what the array will contain
results_array = Manager().array(ctypes.c_int, len(files))
with Pool(processes=NB_CPU) as pool:
    # this is used to pass args to multiprocessed function
    function_with_args = partial(processing_func, results_array)
        # this will iterate through the files and fill NB_CPU processes at a time
        # by applying the function_with_args on each iterated element
        pool.map(function_with_args, files)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM