如何通過應用 numpy 向量化使用條件檢查從 python 列表或 numpy 數組中提取值？

Question

我有以下代碼，我想從取決於給定條件的其他列表中提取某些值。 但是我的數據集很大，每個列表中有 100 萬個值。 因此這種嵌套循環的方法花費的時間太長。 是否有使用 Numpy 的矢量化或更快的方法，我可以用它來加速我的代碼並使用更少的內存？

import random
import numpy as np

x=[random.randrange(0,10) for _ in range(0,100)]
y=[random.randrange(0,10) for _ in range(0,100)]
z=[random.randrange(0,10) for _ in range(0,100)]

x_unique=np.unique(x)

xx_list=[]
y_list=[]
z_list=[]

for i in range(len(x_unique)):
    xx_list.append([])
    y_list.append([])
    z_list.append([])

for ii, i in enumerate(x_unique):
        for j,k in enumerate(x):
            if i == k:
                xx_list[ii].append(x[j])
                y_list[ii].append(y[j])
                z_list[ii].append(z[j])

[編輯：添加了一個期望的例子]

在列表中：y_list 和 z_list，我想存儲與 xx_list 中存儲的索引號相同的值。

例如，考慮以下示例列表：

x = [0.1,0.1,1,0.1,2,1,0.1]
y = [1.1,2.1,3,4,5,6,7]
z = [10,11,12,13.1,14,15,16]

因此， x_unique 如下：

x_unique = [0.1,1,2]

xx_list、y_list 和 z_list 應包含以下內容：

xx_list = [[0.1,0.1,0.1,0.1],[1,1],[2]]
y_list = [[1.1,2.1,4,7],[3,6],[5]]
z_list = [[10,11,13.1,16],[12,15],[14]]

Answer 1

我找到了一個解決方案，對於處理 python 列表的 100 萬個項目，該解決方案大約需要 400 毫秒。 以及在處理 numpy 數組時需要 100 毫秒的解決方案。

Python

我使用它為每個輸入列表（ x ， y ， z ）構建一個字典的策略。 這些中的每一個都將充當一組標記的垃圾箱。 對於每個輸入列表，bin i將包含它們在列表x的對應索引等於i 。 對應意味着它們在各自的列表中處於相同的位置。

def compute_bins(x, y, z):
    # You can see this as an ordered-set:
    x_bin_indexes = {a:i for i, a in enumerate(sorted(set(x)))}

    # Each input list has its own set of labeled bins: 
    x_bins = defaultdict(list)
    y_bins = defaultdict(list)
    z_bins = defaultdict(list)

    for item_x, item_y, item_z in zip(x, y, z):
        index = x_bin_indexes[item_x]
        # Drop the item in the corresponding bin:
        x_bins[index].append(item_x)
        y_bins[index].append(item_y)
        z_bins[index].append(item_z)

    # Now we transform the result back to lists of list:
    x_bins = list(x_bins.values())
    y_bins = list(y_bins.values())
    z_bins = list(z_bins.values())
    return x_bins, y_bins, z_bins

這里的關鍵因素是我們在循環中進行的每個操作都是在恆定時間內進行的。 該函數可以這樣調用：

>>> xx_list, y_list, z_list = compute_bins(x, y, z)
>>> xx_list
[[0, 0, 0, 0], [1, 1], [2]]
>>> y_list
[[1, 2, 4, 7], [3, 6], [5]]
>>> z_list
[[10, 11, 13, 16], [12, 15], [14]]

麻木

使用numpy的，我想到了一個不同的策略：根據排序在項目的所有陣列x則根據連續相同值的數量將它們分割x 。 這是代碼（考慮到x ， y和z是numpy數組）：

import numpy as np

def compute_bins(x, *others):
    x_bin_indexes, bin_sizes = np.unique(x, return_counts=True)
    sort_order = np.argsort(x)
    split_rule = np.cumsum(bin_sizes)[:-1]
    return tuple(np.split(o[sort_order], split_rule) for o in (x, *others))

請注意， np.cumsum(bin_sizes)[:-1]只是因為split需要一個索引列表，而不是一個切割尺寸列表。 如果我們想將[0, 0, 0, 1, 1, 2]拆分為[[0, 0, 0], [1, 1], [2]]我們不傳遞[3, 2, 1]到np.split ，而是[3, 5] 。

演出

關於性能，以下是它在我的機器上的運行方式：

from random import randint

test_size = int(1e6)
x = [randint(0, 100) for _ in range(test_size)]
y = [i+1 for i in range(test_size)]
z = [i+test_size+1 for i in range(test_size)]

%timeit xx_list, y_list, z_list = compute_bins(x, y, z)

純python版本的輸出：

396 ms ± 5.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy版本的輸出（ x 、 y和z是np.array ）：

105 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

為了進行比較，您首先提出的解決方案給出：

19.7 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

如何通過應用 numpy 向量化使用條件檢查從 python 列表或 numpy 數組中提取值？

問題描述

1 個解決方案

解決方案1
1 已采納 2019-03-31 17:39:56

Python

麻木

演出

如何通過應用 numpy 向量化使用條件檢查從 python 列表或 numpy 數組中提取值？

問題描述

1 個解決方案

解決方案1 1 已采納 2019-03-31 17:39:56

Python

麻木

演出

解決方案1
1 已采納 2019-03-31 17:39:56