[英]Progress measuring with python's multiprocessing Pool and map function
Following code I'm using for parallel csv processing: 以下代码我用于并行csv处理:
#!/usr/bin/env python
import csv
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from multiprocessing import current_process
from pprint import pprint as pp
def init_worker(x):
sleep(.5)
print "(%s,%s)" % (x[0],x[1])
x.append(int(x[0])**2)
return x
def parallel_csv_processing(inputFile, outputFile, header=["Default", "header", "please", "change"], separator=",", skipRows = 0, cpuCount = 1):
# OPEN FH FOR READING INPUT FILE
inputFH = open(inputFile, "rt")
csvReader = csv.reader(inputFH, delimiter=separator)
# SKIP HEADERS
for skip in xrange(skipRows):
csvReader.next()
# PARALLELIZE COMPUTING INTENSIVE OPERATIONS - CALL FUNCTION HERE
try:
p = Pool(processes = cpuCount)
results = p.map(init_worker, csvReader, chunksize = 10)
p.close()
p.join()
except KeyboardInterrupt:
p.close()
p.join()
p.terminate()
# CLOSE FH FOR READING INPUT
inputFH.close()
# OPEN FH FOR WRITING OUTPUT FILE
outputFH = open(outputFile, "wt")
csvWriter = csv.writer(outputFH, lineterminator='\n')
# WRITE HEADER TO OUTPUT FILE
csvWriter.writerow(header)
# WRITE RESULTS TO OUTPUT FILE
[csvWriter.writerow(row) for row in results]
# CLOSE FH FOR WRITING OUTPUT
outputFH.close()
print pp(results)
# print len(results)
def main():
inputFile = "input.csv"
outputFile = "output.csv"
parallel_csv_processing(inputFile, outputFile, cpuCount = cpu_count())
if __name__ == '__main__':
main()
I would like to somehow measure the progress of the script (just plain text not any fancy ASCII art). 我想以某种方式衡量脚本的进度(只是纯文本而不是任何奇特的ASCII艺术)。 The one option that comes to my mind is to compare the lines that were successfully processed by init_worker
to all lines in input.csv, and print the actual state eg every second, can you please point me to right solution? 我想到的一个选项是将init_worker
成功处理的行与input.csv中的所有行进行比较,并打印实际状态,例如每秒,你能指出我正确的解决方案吗? I've found several articles with similar problematic but I was not able to adapt it to my needs because neither used the Pool
class and map
method. 我发现有几篇文章有类似的问题,但我无法根据我的需要调整它,因为它们都没有使用Pool
类和map
方法。 I would also like to ask about p.close(), p.join(), p.terminate()
methods, I've seen them mainly with Process
not Pool
class, are they necessary with Pool
class and have I use them correctly? 我还想问一下p.close(), p.join(), p.terminate()
方法,我已经看到它们主要是使用Process
not Pool
类,它们是否需要Pool
类并且我正确使用它们? Using of p.terminate()
was intended to kill the process with ctrl+c but this is different story which has not an happy end yet. 使用p.terminate()
意图用ctrl + c来杀死进程,但这是一个不同的故事,但还没有结束。 Thank you. 谢谢。
PS: My input.csv looks like this, if it matters: PS:我的input.csv看起来像这样,如果重要的话:
0,0
1,3
2,6
3,9
...
...
48,144
49,147
PPS: as I said I'm newbie in multiprocessing
and the code I've put together just works. PPS:正如我所说,我是multiprocessing
新手,而且我把它放在一起的代码才有效。 The one drawback I can see is that whole csv is stored in memory, so if you guys have better idea do not hesitate to share it. 我可以看到的一个缺点是整个csv存储在内存中,所以如果你们有更好的想法,请不要犹豫,分享它。
Edit 编辑
in reply to @JFSebastian 回复@JFSebastian
Here is my actual code based on your suggestions: 以下是基于您的建议的实际代码:
#!/usr/bin/env python
import csv
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from multiprocessing import current_process
from pprint import pprint as pp
from tqdm import tqdm
def do_job(x):
sleep(.5)
# print "(%s,%s)" % (x[0],x[1])
x.append(int(x[0])**2)
return x
def parallel_csv_processing(inputFile, outputFile, header=["Default", "header", "please", "change"], separator=",", skipRows = 0, cpuCount = 1):
# OPEN FH FOR READING INPUT FILE
inputFH = open(inputFile, "rb")
csvReader = csv.reader(inputFH, delimiter=separator)
# SKIP HEADERS
for skip in xrange(skipRows):
csvReader.next()
# OPEN FH FOR WRITING OUTPUT FILE
outputFH = open(outputFile, "wt")
csvWriter = csv.writer(outputFH, lineterminator='\n')
# WRITE HEADER TO OUTPUT FILE
csvWriter.writerow(header)
# PARALLELIZE COMPUTING INTENSIVE OPERATIONS - CALL FUNCTION HERE
try:
p = Pool(processes = cpuCount)
# results = p.map(do_job, csvReader, chunksize = 10)
for result in tqdm(p.imap_unordered(do_job, csvReader, chunksize=10)):
csvWriter.writerow(result)
p.close()
p.join()
except KeyboardInterrupt:
p.close()
p.join()
# CLOSE FH FOR READING INPUT
inputFH.close()
# CLOSE FH FOR WRITING OUTPUT
outputFH.close()
print pp(result)
# print len(result)
def main():
inputFile = "input.csv"
outputFile = "output.csv"
parallel_csv_processing(inputFile, outputFile, cpuCount = cpu_count())
if __name__ == '__main__':
main()
Here is output of tqdm
: 这是tqdm
输出:
1 [elapsed: 00:05, 0.20 iters/sec]
what does this output mean? 这个输出是什么意思? On the page you've referred tqdm
is used in loop following way: 在您引用的页面上, tqdm
以循环方式使用:
>>> import time
>>> from tqdm import tqdm
>>> for i in tqdm(range(100)):
... time.sleep(1)
...
|###-------| 35/100 35% [elapsed: 00:35 left: 01:05, 1.00 iters/sec]
This output makes sense, but what does my output mean? 这个输出很有意义,但我的输出是什么意思? Also it does not seems that ctrl+c problem is fixed: after hitting ctrl+c script throws some Traceback, if I hit ctrl+c again then I get new Traceback and so on. 此外它似乎没有修复ctrl + c问题:点击ctrl + c脚本后抛出一些Traceback,如果我再次点击ctrl + c然后我得到新的Traceback等等。 The only way to kill it is sending it to background (ctr+z) and then kill it (kill %1) 杀死它的唯一方法是将其发送到后台(ctr + z)然后杀死它(杀死%1)
To show the progress, replace pool.map
with pool.imap_unordered
: 要显示进度,请将pool.map
替换为pool.imap_unordered
:
from tqdm import tqdm # $ pip install tqdm
for result in tqdm(pool.imap_unordered(init_worker, csvReader, chunksize=10)):
csvWriter.writerow(result)
tqdm
part is optional, see Text Progress Bar in the Console tqdm
部分是可选的,请参阅控制台中的文本进度条
Accidentally, it fixes your "whole csv is stored in memory" and "KeyboardInterrupt is not raised" problems. 无意中,它修复了“整个csv存储在内存中”和“KeyboardInterrupt未引发”问题。
Here's a complete code example: 这是一个完整的代码示例:
#!/usr/bin/env python
import itertools
import logging
import multiprocessing
import time
def compute(i):
time.sleep(.5)
return i**2
if __name__ == "__main__":
logging.basicConfig(format="%(asctime)-15s %(levelname)s %(message)s",
datefmt="%F %T", level=logging.DEBUG)
pool = multiprocessing.Pool()
try:
for square in pool.imap_unordered(compute, itertools.count(), chunksize=10):
logging.debug(square) # report progress by printing the result
except KeyboardInterrupt:
logging.warning("got Ctrl+C")
finally:
pool.terminate()
pool.join()
You should see the output in batches every .5 * chunksize
seconds. 您应该每隔.5 * chunksize
秒批量查看输出。 If you press Ctrl+C
; 如果按Ctrl+C
; you should see KeyboardInterrupt
raised in the child processes and in the main process. 你应该看到在子进程和主进程中引发的KeyboardInterrupt
。 In Python 3, the main process exits immediately. 在Python 3中,主进程立即退出。 In Python 2, the KeyboardInterrupt
is delayed until the next batch should have been printed (bug in Python). 在Python 2中, KeyboardInterrupt
被延迟,直到应该打印下一个批处理(Python中的错误)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.