简体   繁体   English

在bigdata中使用python代码中的map()进行多重处理

[英]working with multiprocessing using map() in python code in bigdata

I'm trying to get some values(which I get using extract function) from urls which are stored in data.file and there are about 3000000 url links in the file. 我正在尝试从存储在data.file中的网址中获取一些值(我使用提取函数获取),并且该文件中大约有3000000个网址链接。 here is my code snippet, 这是我的代码段,

from multiprocessing import Pool
p = Pool(10)
revenuelist = p.map(extract, data.file )

But the problem is, due to internet connection, this is code runs again if there connection problem. 但是问题是,由于存在互联网连接,如果存在连接问题,这是代码再次运行。 How do I add fault tolerance to my code(i:e store intermediate result, to avoid repetition of doing same task). 如何为我的代码增加容错能力(即存储中间结果,以避免重复执行相同的任务)。

A very simple solution is using a file to store your current status. 一个非常简单的解决方案是使用文件存储您的当前状态。 Use try...finally to handle fails: 使用try ... finally处理失败:

with open(FILENAME) as f:
    current = int(f.read() or 0)

if current:
    skip_lines(current)

try:
    with Pool() as pool:
        results = pool.imap(extract, data.file)
        for result in results:
            do_something(result)
            current += 1
finally:
    with open(FILENAME, "w") as f:
        f.write(str(current))

See also: 'concurrent.futures` (much cooler than multiprocessing.Pool). 另请参见: 'concurrent.futures' (比multiprocessing.Pool凉得多)。

A better solution would be using a database to completely track your progress, and/or use a better task queue (for example, celery ) to execute your jobs. 更好的解决方案是使用数据库完全跟踪您的进度,和/或使用更好的任务队列(例如celery )执行您的工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM