简体   繁体   English

Python:多处理的泡菜错误

[英]Python: pickle error with multiprocessing

As suggested below, I have changed my code to use Pool instead. 如下所示,我将代码更改为使用Pool。 I've also simplified my functions and included all my code below. 我还简化了功能,并在下面包含了所有代码。 However, now I'm getting a different error: NameError: global name 'split_files' is not defined 但是,现在我得到了另一个错误:NameError:未定义全局名称'split_files'

What I want to do is pass the actual file chunk into the parse_csv_chunk function but I'm not sure how to do that. 我想做的是将实际的文件块传递给parse_csv_chunk函数,但是我不确定该怎么做。

import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile

csv_filename = 'test.csv'

def parse_csv_chunk(files_index):
    global split_files
    print files_index
    print len(split_files)
    return 1

def split(infilename, num_chunks):
    READ_BUFFER = 2**13
    in_file_size = os.path.getsize(infilename)
    print 'Original file size:', in_file_size
    chunk_size = in_file_size // num_chunks
    print 'Target chunk size:', chunk_size
    print 'Target number of chunks:', num_chunks
    files = []
    with open(infilename, 'rb', READ_BUFFER) as infile:
        infile.next()
        infile.next()
        infile.next()
        for _ in xrange(num_chunks):
            temp_file = tempfile.TemporaryFile()
            while temp_file.tell() < chunk_size:
                try:
                    #write 3 lines before checking if still < chunk_size
                    #this is done to improve performance
                    #the result is that each chunk will not be exactly the same size
                    temp_file.write(infile.next())
                    temp_file.write(infile.next())
                    temp_file.write(infile.next())
                #end of original file
                except StopIteration:
                    break
            #rewind each chunk
            temp_file.seek(0)
            files.append(temp_file)
    return files

if __name__ == '__main__':
    start = time.time()
    num_chunks = mp.cpu_count()
    split_files = split(csv_filename, num_chunks)
    print 'Number of files after splitting: ', len(split_files)

    pool = mp.Pool(processes = num_chunks)
    results = [pool.apply_async(parse_csv_chunk, args=(x,)) for x in range(num_chunks)]
    output = [p.get() for p in results]
    print output

I'm trying to split up a csv file into parts and have them processed by each of my CPU's cores. 我正在尝试将csv文件拆分为多个部分,并由每个CPU内核对其进行处理。 This is what I have so far: 这是我到目前为止的内容:

import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile

def parse_csv_chunk(infile):
    #code here
    return

def split(infilename, num_chunks):
    #code here
    return files

def get_header_indices(infilename):
    #code here
    return

if __name__ == '__main__':
    start = time.time() #start measuring performance
    num_chunks = mp.cpu_count() #record number of CPU cores
    files = split(csv_filename, num_chunks) #split csv file into a number equal of CPU cores and store as list
    print 'number of files after splitting: ', len(files)
    get_header_indices(csv_filename) #get headers of csv file
    print headers_list

    processes = [mp.Process(target=parse_csv_chunk, 
       args=ifile) for ifile in enumerate(files)] #create a list of processes for each file chunk

    for p in processes:
        p.start()

    for p in processes:
        p.join()

    end = time.time()

    print "Execution time: %.2f" % (end - start) #display performance

There seems to be a problem at the line 'p.start()'. “ p.start()”行似乎有问题。 I see a lot of output on the console, which eventually indicates an error: 我在控制台上看到很多输出,最终表明存在错误:

pickle.PicklingError: Can't pickle <built-in method write of file object at 0x02
22EAC8>: it's not found as __main__.write

I did not include the code for the functions I called as they are quite long, but I can if needed. 我没有包含我调用的函数的代码,因为它们很长,但是如果需要的话,我可以。 I'm wondering if I'm using multiprocessing correctly. 我想知道我是否正确使用了多处理。

First off, if there a reason you are not using a Pool and the imap method of the Pool ? 首先,如果您有不使用的理由Poolimap的方法Pool

Second, it's very hard to tell any specifics without seeing your code, especially since the error points to parts of the code that are not provided. 其次,很难不看代码就说出任何细节,尤其是因为错误指向未提供的代码部分。

However, it looks like you are using multiprocessing correctly from what you have provided -- and it's a serialization problem. 但是,您似乎已从提供的内容中正确使用了multiprocessing -这是一个序列化问题。

Note that if you use dill , you can serialize the write method. 请注意,如果使用dill ,则可以序列化write方法。

>>> import dill
>>> 
>>> f = open('foo.bar', 'w') 
>>> dill.dumps(f.write)
'\x80\x02cdill.dill\n_get_attr\nq\x00cdill.dill\n_create_filehandle\nq\x01(U\x07foo.barq\x02U\x01wq\x03K\x00\x89c__builtin__\nopen\nq\x04\x89K\x00U\x00q\x05tq\x06Rq\x07U\x05writeq\x08\x86q\tRq\n.'

Most versions of multiprocessing use cPickle (or a version of pickle that is built in C), and while dill can inject it's types into the python version of pickle , it can't do so in the C equivalent. 多数版本的multiprocessing使用cPickle (或C内置的pickle版本),而dill可以将其类型注入pickle的python版本中,而C语言则不能这样做。

There is a dill -activated fork of multiprocessing -- so you might try that, as if it's purely a pickling problem, then you should get past it with multiprocess . 有一个dill激活的multiprocessing分支-因此您可以尝试一下,就像纯粹是一个酸洗问题一样,那么您应该通过multiprocess克服它。

See: https://github.com/uqfoundation/multiprocess . 参见: https : //github.com/uqfoundation/multiprocess

EDIT (after OP update): The global declaration in your helper function isn't going to play well with pickle . 编辑 (在OP更新之后):helper函数中的global声明不能与pickle Why not just use a payload function (like split ) that reads a portion of the file and returns the contents, or writes to the target file? 为什么不只使用有效载荷函数(如split )来读取文件的一部分并返回内容,或写入目标文件? Don't return a list of files. 不要返回文件列表。 I know they are TemporaryFile objects, but unless you use dill (and even then it's touchy) you can't pickle a file. 我知道它们是TemporaryFile对象,但是除非您使用dill (即使这样很敏感),否则您也不能腌制文件。 If you absolutely have to, return the file name, not the file, and don't use a TemporaryFile . 如果绝对需要,请返回文件名,而不是文件,并且不要使用TemporaryFile pickle will choke trying to pass the file. pickle会阻止尝试传递文件。 So, you should refactor your code, or as I suggested earlier, try to see if you can bypass serialization issues by using multiprocess (which uses dill ). 因此,您应该重构代码,或者像我之前建议的那样,尝试查看是否可以通过使用multiprocess (使用dill )来绕过序列化问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM