Python多处理池和内存

Question

I'm using Pool.map for a scoring procedure: 我正在使用Pool.map进行评分：

"cursor" with millions of arrays from a data source 具有数据源中数百万个数组的“游标”
calculation 计算
save the result in a data sink 将结果保存到数据接收器

The results are independent. 结果是独立的。

I'm just wondering if I can avoid the memory demand. 我只是想知道我是否可以避免内存需求。 At first it seems that every array goes into python and then the 2 and 3 are proceed. 最初，似乎每个数组都进入python，然后继续执行2和3。 Anyway I have a speed improvement. 无论如何，我的速度都有提高。

#data src and sink is in mongodb#
def scoring(some_arguments):
        ### some stuff  and finally persist  ###
    collection.update({uid:_uid},{'$set':res_profile},upsert=True)


cursor = tracking.find(timeout=False)
score_proc_pool = Pool(options.cores)    
#finaly I use a wrapper so I have only the document as input for map
score_proc_pool.map(scoring_wrapper,cursor,chunksize=10000)

Am I doing something wrong or is there a better way with python for this purpose? 我是在做错什么，还是为此目的使用python有更好的方法？

Answer 1

The map functions of a Pool internally convert the iterable to a list if it doesn't have a __len__ attribute. 如果没有__len__属性，则Pool的map函数会在内部将其迭代为列表。 The relevant code is in Pool.map_async , as that is used by Pool.map (and starmap ) to produce the result - which is also a list. 相关代码位于Pool.map_async ，因为Pool.map （和starmap ）使用该代码来产生结果-这也是一个列表。

If you don't want to read all data into memory first, you should use Pool.imap or Pool.imap_unordered , which will produce an iterator that will yield the results as they come in. 如果您不想首先将所有数据读入内存，则应使用Pool.imap或Pool.imap_unordered ，这将产生一个迭代器，该迭代器将在输入结果时产生结果。

Python多处理池和内存

问题描述

1 个解决方案

解决方案1
1 2013-07-16 09:23:06

Python多处理池和内存

问题描述

1 个解决方案

解决方案1 1 2013-07-16 09:23:06

解决方案1
1 2013-07-16 09:23:06