共享输出的python多重处理

Question

I am trying to process a large number of text files and calculating data within them (simple addition). 我正在尝试处理大量文本文件并计算其中的数据（简单添加）。 The problem is this takes a long time, and know that there are some multiprocessing functions in other languages, but have never done anything like this in Python. 问题是这需要很长时间，并且知道其他语言中有一些多处理功能，但是从未在Python中做过这样的事情。

Let's say I have a directory with 16,000 files. 假设我有一个包含16,000个文件的目录。 Currently, I open each file individually, bring it into an array in Python, do some manipulation of the data, and then output to a master array (with length 16,000). 目前，我分别打开每个文件，将其放入Python数组中，对数据进行一些处理，然后输出到主数组（长度为16,000）。 Can a multiprocessing function be utilized to run several instances of 'opening the file, processing the data, and output information' to the same array? 可以使用多重处理功能来将“打开文件，处理数据和输出信息”的多个实例运行到同一阵列吗？

The original code is basically like this: 原始代码基本上是这样的：

# path
filepath = /path/to/file

# Get the dir contents
filedir = os.listdir(filepath)

# Pre-allocate large array
large_array = np.zeros(len(filedir))

# Begin loop
for i in range(0,len(filedir)):
    # Define the path to load the text file
    filename = filepath + '/' + filedir[i]

    output = []
    output = function_to_process_filename(filename)

    large_array[i] = output

Where would the multiprocessing / parallel portion go to potentially make the code run faster and what does it look like in Python? 多处理/并行部分将在哪里使代码运行得更快，它在Python中是什么样？

Answer 1

You can use a multiprocessing Pool to submit work to a pool of processes. 您可以使用多处理池将工作提交到流程池。

The map function will take an iterable and split it into chunks of work, on which your function can be applied (see here ): map函数将进行迭代，并将其拆分为多个工作块，可以在其上应用您的函数（请参见此处）：

This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. 此方法将可迭代项分为多个块，将其作为单独的任务提交给流程池。 The (approximate) size of these chunks can be specified by setting chunksize to a positive integer. 这些块的（大约）大小可以通过将chunksize设置为正整数来指定。

For your example, you could pass a list of file names the map function and a function that will open the file and manipulate it. 对于您的示例，您可以传递文件名列表map函数和一个将打开文件并对其进行操作的函数。 You can pass the processed file contents as result and concatenate everything in the main process. 您可以将处理后的文件内容作为结果传递，并在主过程中将所有内容串联在一起。

Answer 2

So if I understand correctly, what you are looking for is a way to process several jobs in parallel with multiprocessing and each job filling a single python data structure containing the results ? 因此，如果我理解正确，那么您正在寻找一种与多处理并行处理多个作业的方法，而每个作业将填充包含结果的单个python数据结构？ I will complete the good previous answer by using indeed map but also multiprocessing.Manager() : 我将通过确实使用map以及multiprocessing.Manager()来完成先前的良好回答：

from multiprocessing import Pool, Manager, ctypes, cpu_count
from functools import partial

# path to dir
dir_path = /path/to/dir

# Get the dir content
files = os.listdir(dir_path)

def processing_func(results_array, filename):
    # process filename
    # add element to results_array

NB_CPU = cpu_count()
# change ctype with what the array will contain
results_array = Manager().array(ctypes.c_int, len(files))
with Pool(processes=NB_CPU) as pool:
    # this is used to pass args to multiprocessed function
    function_with_args = partial(processing_func, results_array)
        # this will iterate through the files and fill NB_CPU processes at a time
        # by applying the function_with_args on each iterated element
        pool.map(function_with_args, files)

共享输出的python多重处理

问题描述

2 个解决方案

解决方案1
0 2019-08-01 19:01:53

解决方案2
0 2019-08-01 19:53:25

共享输出的python多重处理

问题描述

2 个解决方案

解决方案1 0 2019-08-01 19:01:53

解决方案2 0 2019-08-01 19:53:25

解决方案1
0 2019-08-01 19:01:53

解决方案2
0 2019-08-01 19:53:25