通過組合 numpy 數組和元組列表，高效地創建 python 字典列表

Question

我試圖通過結合元組列表和一維 numpy 數組來找到一個時間和 memory 高性能方式來創建字典。 因此，每個字典應具有以下結構：

{"id": <first elem from tuple>, "name": <second elem from tuple>, "text": <third elem from tuple>, "value": <elem from array>}

元組列表如下所示：

list_of_tuples = [("id1", "name1", "text1"), ("id2", "name2", "text2")]

此外，numpy 數組的元素數量與列表相同，並且包含 np.float16 類型的元素：

value_array = np.ndarray([0.42, np.nan])

此外，所有的 NaN 值都將被過濾掉。 上面例子的結果應該是：

{"id": "id1", "name": "name1", "text": "text1", "value": 0.42}

我確實讓它像這樣工作：

[
    dict(
        dict(zip(["id", "name", "text"], list_of_tuples[index])),
        **{"value": value},
    )
    for index, value in enumerate(value_array)
    if not (math.isnan(value))
]

但是，這對於許多條目來說非常慢，並且使用索引從列表中獲取條目感覺錯誤/效率低下。

Answer 1

您絕對可以工作而無需顯式使用索引。 這應該會提高性能。

value_array_indices = np.argwhere(~np.isnan(value_array)) 
list_of_tuples = np.array(list_of_tuples)[value_array_indices[0]]
value_array = value_array[value_array_indices[0]]
[{"id": x[0], "name": x[1], "text": x[2], "value": v} for x,v in zip(list_of_tuples, value_array)]

Answer 2

看起來有人在我寫這篇文章的時候發布了一個類似的解決方案，但無論如何我都會發布它，因為測量的時間和一些解釋的話。

使用下面建議的代碼，並使用長度為一百萬的測試輸入（包括單個 NaN），與問題中的代碼相比，我看到它的 go 下降到不到 30% 的時間。

Time: 0.3486933708190918
{'id': 'id0', 'name': 'name0', 'text': 'text0', 'value': 0.0} {'id': 'id999999', 'name': 'name999999', 'text': 'text999999', 'value': 999999.0} 999999
Time: 1.2175893783569336
{'id': 'id0', 'name': 'name0', 'text': 'text0', 'value': 0.0} {'id': 'id999999', 'name': 'name999999', 'text': 'text999999', 'value': 999999.0} 999999

我認為這里的區別部分是不必索引元組列表，但我懷疑其中很大一部分不必為每個元素實例化zip object。 您每次都在處理少量具有相同名稱的字典鍵，因此您真的不需要zip在這里提供的靈活性，而且從簡單的顯式表達式創建字典更直接。

（ zip(list_of_tuples, value_array)顯然只為整個操作創建一個zip object ，所以意義不大。）

我還建議from math import isnan here 而不是每次都進行屬性查找來獲取math.isnan ，盡管結果證明差異相對不重要。

from math import isnan
import numpy as np
import time

# construct some test data
n = 1000000
value_array = np.arange(n, dtype=np.float)
value_array[n // 2] = np.nan
list_of_tuples = [(f"id{i}", f"name{i}", f"text{i}")
                  for i in range(len(value_array))]

# timings for suggested alternative method
t0 = time.time()
l = [{"id": t[0],
      "name": t[1],
      "text": t[2],
      "value": v}
     for t, v in zip(list_of_tuples, value_array) if not isnan(v)]
t1 = time.time()
print("Time:", t1 - t0)
print(l[0], l[-1], len(l))

# timings for the method in the question
t0 = time.time()
l = \
[
    dict(
        dict(zip(["id", "name", "text"], list_of_tuples[index])),
        **{"value": value},
    )
    for index, value in enumerate(value_array)
    if not (isnan(value))
]
t1 = time.time()
print("Time:", t1 - t0)
print(l[0], l[-1], len(l))

也嘗試過並被拒絕：創建一個not isnan值的 boolean 數組，使用

not_isnan_array = np.logical_not(np.isnan(value_array))

然后在列表理解中您可以執行以下操作：

... for t, v, not_isnan in zip(list_of_tuples, value_array, not_isnan_array) if not_isnan

但它對時序幾乎沒有影響，因此不能證明額外的 memory 使用是合理的。

更新

對混合版本的進一步實驗（在問題中的原始版本和建議的替代版本之間）表明，正如我所懷疑的，大部分差異來自於避免在每次迭代中創建zip object。 避免顯式索引元組列表僅占加速的一小部分。

通過組合 numpy 數組和元組列表，高效地創建 python 字典列表

問題描述

2 個解決方案

解決方案1
1 已采納 2020-07-02 19:15:36

解決方案2
1 2020-07-02 20:10:06

更新

通過組合 numpy 數組和元組列表，高效地創建 python 字典列表

問題描述

2 個解決方案

解決方案1 1 已采納 2020-07-02 19:15:36

解決方案2 1 2020-07-02 20:10:06

更新

解決方案1
1 已采納 2020-07-02 19:15:36

解決方案2
1 2020-07-02 20:10:06