简体   繁体   English

对于2d numpy数组的每一行,在第二个2d数组中获取相等行的索引

[英]For each row of a 2d numpy array get the index of an equal row in a second 2d array

I have two huge 2d numpy integer arrays X and U, where U is assumed to have only unqiue rows. 我有两个巨大的2d numpy整数数组X和U,其中U被假定只有unqiue行。 For each row in XI would like to get the corresponding row index of the matching row in U (if there is one, otherwise -1). 对于XI中的每一行,希望得到U中匹配行的相应行索引(如果有的话,则为-1)。 Eg, if the following arrays are passed as inputs: 例如,如果以下数组作为输入传递:

U = array([[1, 4],
       [2, 5],
       [3, 6]])

X = array([[1, 4],
       [3, 6],
       [7, 8],
       [1, 4]])

the output should be: 输出应该是:

array([0,2,-1,0])

Is there an efficient way of doing this (or something similar) with Numpy? 有没有一种有效的方法与Numpy一起做这个(或类似的东西)?

@ Divakar: Your approach fails for me @Divakar:你的方法对我失败了

print(type(rows), rows.dtype, rows.shape)
print(rows[:10])
print(search2D_indices(rows[:10], rows[:10]))

<class 'numpy.ndarray'> int32 (47398019, 5)
[[65536     1     1     1    17]
 [65536     1     1     1   153]
 [65536     1     1     2   137]
 [65536     1     1     3   153]
 [65536     1     1     9   124]
 [65536     1     1    13   377]
 [65536     1     1    13   134]
 [65536     1     1    13   137]
 [65536     1     1    13   153]
 [65536     1     1    13   439]]
[ 0  1  2  3  4 -1 -1 -1 -1  9]

Approach #1 方法#1

Inspired by this solution to Find the row indexes of several values in a numpy array , here's a vectorized solution using searchsorted - 受此this solution启发, Find the row indexes of several values in a numpy array ,这是使用searchsorted的矢量化解决方案 -

def search2D_indices(X, searched_values, fillval=-1):
    dims = np.maximum(X.max(0), searched_values.max(0))+1
    X1D = np.ravel_multi_index(X.T,dims)
    searched_valuesID = np.ravel_multi_index(searched_values.T,dims)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]
    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)

Sample run - 样品运行 -

In [121]: U
Out[121]: 
array([[1, 4],
       [2, 5],
       [3, 6]])

In [122]: X
Out[122]: 
array([[1, 4],
       [3, 6],
       [7, 8],
       [1, 4]])

In [123]: search2D_indices(U, X, fillval=-1)
Out[123]: array([ 0,  2, -1,  0])

Approach #2 方法#2

Extending to cases with negative ints, we need to offset dims and the conversion to 1D accordingly, like so - 延伸到具有负数的情况,我们需要相应地抵消dims和转换为1D ,如此 -

def search2D_indices_v2(X, searched_values, fillval=-1):
    X_lim = X.max()-X.min(0)
    searched_values_lim = searched_values.max()-searched_values.min(0)

    dims = np.maximum(X_lim, searched_values_lim)+1
    s = dims.cumprod()

    X1D = X.dot(s)
    searched_valuesID = searched_values.dot(s)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]

    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)

Sample run - 样品运行 -

In [142]: U
Out[142]: 
array([[-1, -4],
       [ 2,  5],
       [ 3,  6]])

In [143]: X
Out[143]: 
array([[-1, -4],
       [ 3,  6],
       [ 7,  8],
       [-1, -4]])

In [144]: search2D_indices_v2(U, X, fillval=-1)
Out[144]: array([ 0,  2, -1,  0])

Approach #3 方法#3

Another based on views - 另一个基于views -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

def search2D_indices_views(X, searched_values, fillval=-1):
    X1D,searched_valuesID = view1D(X, searched_values)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]
    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)

This is a dictionary-based method: 这是一个基于字典的方法:

import numpy as np

U = np.array([[1, 4],
              [2, 5],
              [3, 6]])

X = np.array([[1, 4],
              [3, 6],
              [7, 8],
              [1, 1]])

d = {v: k for k, v in enumerate(map(tuple, U))}

res = np.array([d.get(tuple(a), -1) for a in X])

# [ 0  2 -1 -1]

You can use broadcasting in order to determine the equity of the items in a vectorized manner. 您可以使用广播以矢量化方式确定项目的权益。 Afterward you can simply use all function over a proper axis to get the desire truth values correspond to expected indices. 之后,您可以简单地使用适当轴上的all函数来获得与预期指数对应的所需真值。 Finally, using np.where get the indices of where the equity happens and simply reassign it to a previously created array filled with -1. 最后,使用np.where获取权益发生位置的索引,并简单地将其重新分配给先前创建的填充-1的数组。

In [47]: result = np.full(X.shape[0], -1)

In [48]: x, y = np.where((X[:,None] == U).all(-1))

In [49]: result[x] = y

In [50]: result
Out[50]: array([ 0,  2, -1,  0])

Note that as it's also mentioned in documentation, regard broad casting you should note that: 请注意,正如在文档中也提到的那样,考虑到广泛的演员,你应该注意:

while this is very efficient in terms of lines of code, it may or may not be computationally efficient. 虽然这在代码行方面非常有效,但它可能具有计算效率,也可能不具备计算效率。 The issue is the three-dimensional diff array calculated in an intermediate step of the algorithm. 问题是在算法的中间步骤中计算的三维diff阵列。 For small data sets, creating and operating on the array is likely to be very fast. 对于小型数据集,在阵列上创建和操作可能非常快。 However, large data sets will generate a large intermediate array that is computationally inefficient. 但是,大数据集将生成计算效率低的大型中间阵列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM