迭代numpy数组并更新每个元素的最快方法

Question

这对你们来说可能很奇怪，但是我碰巧有一个很奇怪的目标要实现，代码如下。

# A is a numpy array, dtype=int32,
# and each element is actually an ID(int), the ID range might be wide,
# but the actually existing values are quite fewer than the dense range,
A = array([[379621, 552965, 192509],
       [509849, 252786, 710979],
       [379621, 718598, 591201],
       [509849,  35700, 951719]])

# and I need to map these sparse ID to dense ones,
# my idea is to have a dict, mapping actual_sparse_ID -> dense_ID
M = {}

# so I iterate this numpy array, and check if this sparse ID has a dense one or not
for i in np.nditer(A, op_flags=['readwrite']):
    if i not in M:
        M[i] = len(M)  # sparse ID got a dense one
    i[...] = M[i]   # replace sparse one with the dense ID

我的目标可以通过np.unique(A, return_inverse=True) ，而return_inverse结果就是我想要的。

但是，我拥有的numpy数组太大，无法完全加载到内存中，因此我无法对整个数据运行np.unique，这就是为什么我想出了这个dict-mapping构想的原因...

这是正确的方法吗？ 有没有可能的改善？

Answer 1

我将尝试通过在子数组上使用numpy.unique()提供一种替代方法。 该解决方案尚未经过全面测试。 我也没有进行任何并行的性能评估，因为您的解决方案无法完全为我服务。

假设我们有一个数组c ，我们将其分为两个较小的数组。 让我们创建一些测试数据，例如：

>>> a = np.array([[1,1,2,3,4],[1,2,6,6,2],[8,0,1,1,4]])
>>> b = np.array([[11,2,-1,12,6],[12,2,6,11,2],[7,0,3,1,3]])
>>> c = np.vstack([a, b])
>>> print(c)
[[ 1  1  2  3  4]
 [ 1  2  6  6  2]
 [ 8  0  1  1  4]
 [11  2 -1 12  6]
 [12  2  6 11  2]
 [ 7  0  3  1  3]]

这里我们假设c是大数组，而a和b是子数组。 当然，可以先构建c ，然后提取子数组。

下一步是在两个子数组上运行numpy.unique() ：

>>> ua, ia = np.unique(a, return_inverse=True)
>>> ub, ib = np.unique(b, return_inverse=True)
>>> uc, ic = np.unique(c, return_inverse=True) # this is for future reference

现在，这是一种组合子数组结果的算法：

def merge_unique(ua, ia, ub, ib):
    # make copies *if* changing inputs is undesirable:
    ua = ua.copy()
    ia = ia.copy()
    ub = ub.copy()
    ib = ib.copy()

    # find differences between unique values in the two arrays:
    diffab = np.setdiff1d(ua, ub, assume_unique=True)
    diffba = np.setdiff1d(ub, ua, assume_unique=True)

    # find indices in ua, ub where to insert "other" unique values:
    ssa = np.searchsorted(ua, diffba)
    ssb = np.searchsorted(ub, diffab)

    # throw away values that are too large:
    ssa = ssa[np.where(ssa < len(ua))]
    ssb = ssb[np.where(ssb < len(ub))]

    # increment indices past previously computed "insert" positions:
    for v in ssa[::-1]:
        ia[ia >= v] += 1
    for v in ssb[::-1]:
        ib[ib >= v] += 1

    # combine results:
    uc = np.union1d(ua, ub) # or use ssa, ssb, diffba, diffab to update ua, ub
    ic = np.concatenate([ia, ib])
    return uc, ic

现在，让我们对子数组中numpy.unique()的结果运行此函数，然后将合并的索引和唯一值与参考结果uc和ic ：

>>> uc2, ic2 = merge_unique(ua, ia, ub, ib)
>>> np.all(uc2 == uc)
True
>>> np.all(ic2 == ic)
True

拆分成两个以上的子数组可以用很少的额外工作来处理-只需不断累积“唯一”的值和索引，如下所示：

uacc, iacc = np.unique(subarr1, return_inverse=True)
ui, ii = np.unique(subarr2, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr3, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr4, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
................................ (etc.)

迭代numpy数组并更新每个元素的最快方法

问题描述

1 个解决方案

解决方案1
0 2018-05-27 18:47:38

迭代numpy数组并更新每个元素的最快方法

问题描述

1 个解决方案

解决方案1 0 2018-05-27 18:47:38

解决方案1
0 2018-05-27 18:47:38