[英]fastest way to iterate a numpy array and update each element
这对你们来说可能很奇怪,但是我碰巧有一个很奇怪的目标要实现,代码如下。
# A is a numpy array, dtype=int32,
# and each element is actually an ID(int), the ID range might be wide,
# but the actually existing values are quite fewer than the dense range,
A = array([[379621, 552965, 192509],
[509849, 252786, 710979],
[379621, 718598, 591201],
[509849, 35700, 951719]])
# and I need to map these sparse ID to dense ones,
# my idea is to have a dict, mapping actual_sparse_ID -> dense_ID
M = {}
# so I iterate this numpy array, and check if this sparse ID has a dense one or not
for i in np.nditer(A, op_flags=['readwrite']):
if i not in M:
M[i] = len(M) # sparse ID got a dense one
i[...] = M[i] # replace sparse one with the dense ID
我的目标可以通过np.unique(A, return_inverse=True)
,而return_inverse
结果就是我想要的。
但是,我拥有的numpy数组太大,无法完全加载到内存中,因此我无法对整个数据运行np.unique,这就是为什么我想出了这个dict-mapping构想的原因...
这是正确的方法吗? 有没有可能的改善?
我将尝试通过在子数组上使用numpy.unique()
提供一种替代方法。 该解决方案尚未经过全面测试。 我也没有进行任何并行的性能评估,因为您的解决方案无法完全为我服务。
假设我们有一个数组c
,我们将其分为两个较小的数组。 让我们创建一些测试数据,例如:
>>> a = np.array([[1,1,2,3,4],[1,2,6,6,2],[8,0,1,1,4]])
>>> b = np.array([[11,2,-1,12,6],[12,2,6,11,2],[7,0,3,1,3]])
>>> c = np.vstack([a, b])
>>> print(c)
[[ 1 1 2 3 4]
[ 1 2 6 6 2]
[ 8 0 1 1 4]
[11 2 -1 12 6]
[12 2 6 11 2]
[ 7 0 3 1 3]]
这里我们假设c
是大数组,而a
和b
是子数组。 当然,可以先构建c
,然后提取子数组。
下一步是在两个子数组上运行numpy.unique()
:
>>> ua, ia = np.unique(a, return_inverse=True)
>>> ub, ib = np.unique(b, return_inverse=True)
>>> uc, ic = np.unique(c, return_inverse=True) # this is for future reference
现在,这是一种组合子数组结果的算法:
def merge_unique(ua, ia, ub, ib):
# make copies *if* changing inputs is undesirable:
ua = ua.copy()
ia = ia.copy()
ub = ub.copy()
ib = ib.copy()
# find differences between unique values in the two arrays:
diffab = np.setdiff1d(ua, ub, assume_unique=True)
diffba = np.setdiff1d(ub, ua, assume_unique=True)
# find indices in ua, ub where to insert "other" unique values:
ssa = np.searchsorted(ua, diffba)
ssb = np.searchsorted(ub, diffab)
# throw away values that are too large:
ssa = ssa[np.where(ssa < len(ua))]
ssb = ssb[np.where(ssb < len(ub))]
# increment indices past previously computed "insert" positions:
for v in ssa[::-1]:
ia[ia >= v] += 1
for v in ssb[::-1]:
ib[ib >= v] += 1
# combine results:
uc = np.union1d(ua, ub) # or use ssa, ssb, diffba, diffab to update ua, ub
ic = np.concatenate([ia, ib])
return uc, ic
现在,让我们对子数组中numpy.unique()
的结果运行此函数,然后将合并的索引和唯一值与参考结果uc
和ic
:
>>> uc2, ic2 = merge_unique(ua, ia, ub, ib)
>>> np.all(uc2 == uc)
True
>>> np.all(ic2 == ic)
True
拆分成两个以上的子数组可以用很少的额外工作来处理-只需不断累积“唯一”的值和索引,如下所示:
uacc, iacc = np.unique(subarr1, return_inverse=True)
ui, ii = np.unique(subarr2, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr3, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr4, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
................................ (etc.)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.