Python：使用多个核心的流程文件

Question

I am currently trying to read a large file (80 million lines), where I need to make a computationally intensive matrix multiplication for each entry. 我目前正在尝试读取一个大文件（8000万行），我需要为每个条目进行计算密集型矩阵乘法。 After calculating this, I want to insert the result into a database. 计算完之后，我想将结果插入数据库。 Because of the time intensive manner of this process, I want to split the file onto multiple cores to speed up the process. 由于此过程采用时间密集的方式，我希望将文件拆分为多个核心以加快进程。

After researching I found this promising attempt, which split a file into n parts. 在研究之后，我发现了这个有希望的尝试，它将文件分成n个部分。

def file_block(fp, number_of_blocks, block):
    '''
    A generator that splits a file into blocks and iterates
    over the lines of one of the blocks.

    '''

    assert 0 <= block and block < number_of_blocks
    assert 0 < number_of_blocks

    fp.seek(0,2)
    file_size = fp.tell()

    ini = file_size * block / number_of_blocks
    end = file_size * (1 + block) / number_of_blocks

    if ini <= 0:
        fp.seek(0)
    else:
        fp.seek(ini-1)
        fp.readline()

    while fp.tell() < end:
        yield fp.readline()

Iteratively, you can call the function like this: 迭代地，您可以像这样调用函数：

if __name__ == '__main__':
    fp = open(filename)
    number_of_chunks = 4
    for chunk_number in range(number_of_chunks):
        print chunk_number, 100 * '='
        for line in file_block(fp, number_of_chunks, chunk_number):
            process(line)

While this works, I run into problems, parallelizing this using multiprocessing: 虽然这有效，但我遇到了问题，使用多处理并行化：

fp = open(filename)
number_of_chunks = 4
li = [file_block(fp, number_of_chunks, chunk_number) for chunk_number in range(number_of_chunks)]

p = Pool(cpu_count() - 1)
p.map(processChunk,li)

With the error being, that generators cannot be pickled. 由于错误，生成器无法被腌制。

While I understand this error, it is too expensive to first iterate over the whole file to put all lines into a list. 虽然我理解这个错误，但是首先迭代整个文件以将所有行放入列表中是太昂贵了。

Moreover, I want to use blocks of lines per core per iteration, because it is more efficient to insert multiple lines into the database at once (instead of 1 by 1 if using the typical map approach) 此外，我希望每次迭代使用每个核心的行块，因为一次将多行插入数据库更有效（如果使用典型的映射方法，则不是1乘1）

Thanks for your help. 谢谢你的帮助。

Answer 1

Instead of creating generators up front and passing them into each thread, leave that to the thread code. 不是预先创建生成器并将它们传递到每个线程，而是将其留给线程代码。

def processChunk(params):
    filename, chunk_number, number_of_chunks = params
    with open(filename, 'r') as fp:
        for line in file_block(fp, number_of_chunks, chunk_number):
            process(line)

li = [(filename, i, number_of_chunks) for i in range(number_of_chunks)]
p.map(processChunk, li)

Python：使用多个核心的流程文件

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-11-22 16:37:56

Python：使用多个核心的流程文件

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-11-22 16:37:56

解决方案1
3 已采纳 2016-11-22 16:37:56