[英]Python: pickle error with multiprocessing
As suggested below, I have changed my code to use Pool instead. 如下所示,我将代码更改为使用Pool。 I've also simplified my functions and included all my code below.
我还简化了功能,并在下面包含了所有代码。 However, now I'm getting a different error: NameError: global name 'split_files' is not defined
但是,现在我得到了另一个错误:NameError:未定义全局名称'split_files'
What I want to do is pass the actual file chunk into the parse_csv_chunk function but I'm not sure how to do that. 我想做的是将实际的文件块传递给parse_csv_chunk函数,但是我不确定该怎么做。
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
csv_filename = 'test.csv'
def parse_csv_chunk(files_index):
global split_files
print files_index
print len(split_files)
return 1
def split(infilename, num_chunks):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'Original file size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'Target chunk size:', chunk_size
print 'Target number of chunks:', num_chunks
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
infile.next()
infile.next()
infile.next()
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
#write 3 lines before checking if still < chunk_size
#this is done to improve performance
#the result is that each chunk will not be exactly the same size
temp_file.write(infile.next())
temp_file.write(infile.next())
temp_file.write(infile.next())
#end of original file
except StopIteration:
break
#rewind each chunk
temp_file.seek(0)
files.append(temp_file)
return files
if __name__ == '__main__':
start = time.time()
num_chunks = mp.cpu_count()
split_files = split(csv_filename, num_chunks)
print 'Number of files after splitting: ', len(split_files)
pool = mp.Pool(processes = num_chunks)
results = [pool.apply_async(parse_csv_chunk, args=(x,)) for x in range(num_chunks)]
output = [p.get() for p in results]
print output
I'm trying to split up a csv file into parts and have them processed by each of my CPU's cores. 我正在尝试将csv文件拆分为多个部分,并由每个CPU内核对其进行处理。 This is what I have so far:
这是我到目前为止的内容:
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
def parse_csv_chunk(infile):
#code here
return
def split(infilename, num_chunks):
#code here
return files
def get_header_indices(infilename):
#code here
return
if __name__ == '__main__':
start = time.time() #start measuring performance
num_chunks = mp.cpu_count() #record number of CPU cores
files = split(csv_filename, num_chunks) #split csv file into a number equal of CPU cores and store as list
print 'number of files after splitting: ', len(files)
get_header_indices(csv_filename) #get headers of csv file
print headers_list
processes = [mp.Process(target=parse_csv_chunk,
args=ifile) for ifile in enumerate(files)] #create a list of processes for each file chunk
for p in processes:
p.start()
for p in processes:
p.join()
end = time.time()
print "Execution time: %.2f" % (end - start) #display performance
There seems to be a problem at the line 'p.start()'. “ p.start()”行似乎有问题。 I see a lot of output on the console, which eventually indicates an error:
我在控制台上看到很多输出,最终表明存在错误:
pickle.PicklingError: Can't pickle <built-in method write of file object at 0x02
22EAC8>: it's not found as __main__.write
I did not include the code for the functions I called as they are quite long, but I can if needed. 我没有包含我调用的函数的代码,因为它们很长,但是如果需要的话,我可以。 I'm wondering if I'm using multiprocessing correctly.
我想知道我是否正确使用了多处理。
First off, if there a reason you are not using a Pool
and the imap
method of the Pool
? 首先,如果您有不使用的理由
Pool
和imap
的方法Pool
?
Second, it's very hard to tell any specifics without seeing your code, especially since the error points to parts of the code that are not provided. 其次,很难不看代码就说出任何细节,尤其是因为错误指向未提供的代码部分。
However, it looks like you are using multiprocessing
correctly from what you have provided -- and it's a serialization problem. 但是,您似乎已从提供的内容中正确使用了
multiprocessing
-这是一个序列化问题。
Note that if you use dill
, you can serialize the write
method. 请注意,如果使用
dill
,则可以序列化write
方法。
>>> import dill
>>>
>>> f = open('foo.bar', 'w')
>>> dill.dumps(f.write)
'\x80\x02cdill.dill\n_get_attr\nq\x00cdill.dill\n_create_filehandle\nq\x01(U\x07foo.barq\x02U\x01wq\x03K\x00\x89c__builtin__\nopen\nq\x04\x89K\x00U\x00q\x05tq\x06Rq\x07U\x05writeq\x08\x86q\tRq\n.'
Most versions of multiprocessing
use cPickle
(or a version of pickle
that is built in C), and while dill
can inject it's types into the python version of pickle
, it can't do so in the C equivalent. 多数版本的
multiprocessing
使用cPickle
(或C内置的pickle
版本),而dill
可以将其类型注入pickle
的python版本中,而C语言则不能这样做。
There is a dill
-activated fork of multiprocessing
-- so you might try that, as if it's purely a pickling problem, then you should get past it with multiprocess
. 有一个
dill
激活的multiprocessing
分支-因此您可以尝试一下,就像纯粹是一个酸洗问题一样,那么您应该通过multiprocess
克服它。
See: https://github.com/uqfoundation/multiprocess . 参见: https : //github.com/uqfoundation/multiprocess
EDIT (after OP update): The global
declaration in your helper function isn't going to play well with pickle
. 编辑 (在OP更新之后):helper函数中的
global
声明不能与pickle
。 Why not just use a payload function (like split
) that reads a portion of the file and returns the contents, or writes to the target file? 为什么不只使用有效载荷函数(如
split
)来读取文件的一部分并返回内容,或写入目标文件? Don't return a list of files. 不要返回文件列表。 I know they are
TemporaryFile
objects, but unless you use dill
(and even then it's touchy) you can't pickle a file. 我知道它们是
TemporaryFile
对象,但是除非您使用dill
(即使这样很敏感),否则您也不能腌制文件。 If you absolutely have to, return the file name, not the file, and don't use a TemporaryFile
. 如果绝对需要,请返回文件名,而不是文件,并且不要使用
TemporaryFile
。 pickle
will choke trying to pass the file. pickle
会阻止尝试传递文件。 So, you should refactor your code, or as I suggested earlier, try to see if you can bypass serialization issues by using multiprocess
(which uses dill
). 因此,您应该重构代码,或者像我之前建议的那样,尝试查看是否可以通过使用
multiprocess
(使用dill
)来绕过序列化问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.