Python，使用dict進行高效的並行操作

Question

首先抱歉我的英語不夠完美。

我想，我的問題很容易解釋。

result={}
list_tuple=[(float,float,float),(float,float,float),(float,float,float)...]#200k tuples
threshold=[float,float,float...] #max 1k values
for tuple in list_tuple:
    for value in threeshold:
    if max(tuple)>value and min(tuple)<value:
        if value in result:
            result[value].append(tuple)
        else:
            result[value]=[]
            result[value].append(tuple)

list_tuple包含大約200k元組，我必須非常快速地執行此操作（在普通PC上最多2/3秒）。

我的第一個嘗試是使用prange（）在cython中執行此操作（因此我可以從cython優化和並行執行中獲益），但問題是（一如既往），GIL：在prange（）中我可以管理列表和使用cython memviews的元組，但是我不能在dict中插入我的結果。

在cython中我也嘗試使用c ++ std的unordered_map，但現在的問題是我無法在c ++中創建數組向量（這將是我的dict的值）。

第二個問題是類似的：

list_tuple=[((float,float),(float,float)),((float,float),(float,float))...]#200k tuples of tuples

result={list_tuple[0][0]:[]}

for tuple in list_tuple:
    if tuple[0] in result:
        result[tuple[0]].append(tuple)
    else:
        result[tuple[0]]=[]

這里我還有另一個問題，如果想要使用prange（）我必須使用自定義散列函數來使用數組作為c ++ unordered_map的鍵

正如您所看到的，我的片段非常簡單，可以在並列中運行。

我想嘗試使用numba，但可能會因為GIL而相同，我更喜歡使用cython，因為我需要一個二進制代碼（這個庫可能是商業軟件的一部分，所以只允許使用二進制庫）。

一般來說我想避免使用c / c ++函數，我希望找到一種方法來管理像dicts / list這樣的東西，並且在Python域中盡可能地保留cython性能; 但我對所有建議持開放態度。

謝謝

Answer 1

編輯

由於該方法基本上在數據樣本和閾值之間執行外部產品，因此顯着增加了所需的存儲器，這可能是不期望的。 這里可以找到改進的方法。 我保留這個答案以供將來參考，因為在這個答案中提到了它。

我發現與OP的代碼相比，性能提高了~ 20倍。

這是一個使用numpy的例子。 數據是矢量化的，操作也是如此。 請注意，生成的dict包含空列表，與OP的示例相反，因此可能需要額外的清理步驟（如果適用）。

import numpy as np

# Data setup
data = np.random.uniform(size=(200000, 3))
thresh = np.random.uniform(size=1000)

# Compute tuples for thresholds.
condition = (
    (data.min(axis=1)[:, None] < thresh)
    & (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}

Answer 2

@ a_guest的代碼：

def foo1(data, thresh):
    data = np.asarray(data)
    thresh = np.asarray(thresh)
    condition = (
       (data.min(axis=1)[:, None] < thresh)
       & (data.max(axis=1)[:, None] > thresh)
       )
    result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
    return result

此代碼為thresh每個項創建一次字典條目。

OP代碼，使用default_dict （來自collections ）簡化了一下：

def foo3(list_tuple, threeshold):
    result = defaultdict(list)
    for tuple in list_tuple:
        for value in threeshold:
            if max(tuple)>value and min(tuple)<value:
                result[value].append(tuple)
    return result

這個為符合條件的每個項目更新一次字典條目。

並用他的樣本數據：

In [27]: foo1(data,thresh)
Out[27]: {0: [], 1: [[0, 1, 2]], 2: [], 3: [], 4: [[3, 4, 5]]}
In [28]: foo3(data.tolist(), thresh.tolist())
Out[28]: defaultdict(list, {1: [[0, 1, 2]], 4: [[3, 4, 5]]})

時間測試：

In [29]: timeit foo1(data,thresh)
66.1 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# In [30]: timeit foo3(data,thresh)
# 161 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [31]: timeit foo3(data.tolist(),thresh.tolist())
30.8 µs ± 56.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

對數組的迭代比使用列表慢。 tolist()時間很tolist() ; 列表的np.asarray更長。

使用更大的數據樣本， array版本更快：

In [42]: data = np.random.randint(0,50,(3000,3))
    ...: thresh = np.arange(50)
In [43]: 
In [43]: timeit foo1(data,thresh)
16 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [44]: %%timeit x,y = data.tolist(), thresh.tolist() 
    ...: foo3(x,y)
    ...: 
83.6 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 3

通過使用numpy的矢量化功能，可以實現多項性能改進：

目前，對於每個閾值重新計算min和max 。 相反，它們可以預先計算，然后針對每個閾值重復使用。
循環數據樣本（ list_tuple ）在純Python中執行。 可以使用numpy對此循環進行矢量化。

在以下測試中，我使用了data.shape == (200000, 3); thresh.shape == (1000,) data.shape == (200000, 3); thresh.shape == (1000,)如OP中所示。 我也省略了對result dict修改，因為根據數據，這可能會快速溢出內存。

適用1。

v_min = [min(t) for t in data]
v_max = [max(t) for t in data]
for mi, ma in zip(v_min, v_max):
    for value in thresh:
        if ma > value and mi < value:
            pass

與OP的代碼相比，這會使性能提高~ 5 。

應用1.＆2。

v_min = data.min(axis=1)
v_max = data.max(axis=1)
mask = np.empty(shape=(data.shape[0],), dtype=bool)
for t in thresh:
    mask[:] = (v_min < t) & (v_max > t)
    samples = data[mask]
    if samples.size > 0:
        pass

與OP的代碼相比，性能提高了~ 30 。 這種方法的另一個好處是它不包含對列表的增量append ，這可能會降低程序的速度，因為可能需要重新分配內存。 相反，它會在一次嘗試中創建每個列表（每個閾值）。

Python，使用dict進行高效的並行操作

問題描述

3 個解決方案

解決方案1
0 2018-07-27 13:47:00

編輯

解決方案2
0 2018-07-27 22:46:52

解決方案3
0 已采納 2018-07-30 00:00:46

適用1。

應用1.＆2。

Python，使用dict進行高效的並行操作

問題描述

3 個解決方案

解決方案1 0 2018-07-27 13:47:00

編輯

解決方案2 0 2018-07-27 22:46:52

解決方案3 0 已采納 2018-07-30 00:00:46

適用1。

應用1.＆2。

解決方案1
0 2018-07-27 13:47:00

解決方案2
0 2018-07-27 22:46:52

解決方案3
0 已采納 2018-07-30 00:00:46