在bigdata中使用python代码中的map（）进行多重处理

Question

I'm trying to get some values(which I get using extract function) from urls which are stored in data.file and there are about 3000000 url links in the file. 我正在尝试从存储在data.file中的网址中获取一些值（我使用提取函数获取），并且该文件中大约有3000000个网址链接。 here is my code snippet, 这是我的代码段，

from multiprocessing import Pool
p = Pool(10)
revenuelist = p.map(extract, data.file )

But the problem is, due to internet connection, this is code runs again if there connection problem. 但是问题是，由于存在互联网连接，如果存在连接问题，这是代码再次运行。 How do I add fault tolerance to my code(i:e store intermediate result, to avoid repetition of doing same task). 如何为我的代码增加容错能力（即存储中间结果，以避免重复执行相同的任务）。

Answer 1

A very simple solution is using a file to store your current status. 一个非常简单的解决方案是使用文件存储您的当前状态。 Use try...finally to handle fails: 使用try ... finally处理失败：

with open(FILENAME) as f:
    current = int(f.read() or 0)

if current:
    skip_lines(current)

try:
    with Pool() as pool:
        results = pool.imap(extract, data.file)
        for result in results:
            do_something(result)
            current += 1
finally:
    with open(FILENAME, "w") as f:
        f.write(str(current))

See also: 'concurrent.futures` (much cooler than multiprocessing.Pool). 另请参见： 'concurrent.futures' （比multiprocessing.Pool凉得多）。

A better solution would be using a database to completely track your progress, and/or use a better task queue (for example, celery ) to execute your jobs. 更好的解决方案是使用数据库完全跟踪您的进度，和/或使用更好的任务队列（例如celery ）执行您的工作。

在bigdata中使用python代码中的map（）进行多重处理

问题描述

1 个解决方案

解决方案1
0 2017-03-12 19:25:50

在bigdata中使用python代码中的map（）进行多重处理

问题描述

1 个解决方案

解决方案1 0 2017-03-12 19:25:50

解决方案1
0 2017-03-12 19:25:50