简体   繁体   English

选择包含所有给定值的numpy数组行

[英]Selecting rows of numpy array that contains all the given values

I have a np.array: 我有一个np.array:

matrix = np.array([['A', 'B', 'C'], ['A', 'B', np.nan], ['C', np.nan, np.nan] ])

and I want to efficiently select all the rows that contains the given values 我想有效地选择包含给定值的所有行

samples = ['C', 'A']

but when I make: 但当我做:

mask = np.isin(matrix, samples)

I get 我明白了

array([[ True, False,  True],
       [ True, False, False],
       [ True, False, False]])

How can I efficiently get a mask when is True only in the rows that contains both values? 如果仅在包含两个值的行中为True,我如何才能有效地获取掩码?

I focus on efficiently because its a sparse and big matrix. 我专注于有效,因为它是一个稀疏的大矩阵。

thank you in advance estimates. 谢谢你提前估计。

My first approach would be 我的第一种方法是

[np.isin(samples, row).all() for row in matrix]
# [True, False, False]

(But to be honest not able to tell anything about efficiency or performance...) (但说实话,不能说出有关效率或性能的任何信息......)

If you want something vectorized, I would suggest doing this comparison by transforming this into 3D and broadcasting over the third dimension. 如果你想要一些矢量化的东西,我建议通过将其转换为3D并在第三维上进行广播来进行这种比较。 Then for each slice, check each row to see if there is anything that is True . 然后,对于每个切片,检查每一行以查看是否有任何True Finally, if we see that for each row, every element is True , then this is the result we should return. 最后,如果我们看到对于每一行,每个元素都是True ,那么这就是我们应该返回的结果。

In [40]: matrix = np.array([['A', 'B', 'C'], ['A', 'B', np.nan], ['C', np.nan, np.nan] ])

In [41]: samples = ['C', 'A']

In [42]: samples = np.array(samples)

In [43]: mask = matrix[...,None] == samples[None,None]

In [44]: mask
Out[44]:
array([[[False,  True],
        [False, False],
        [ True, False]],

       [[False,  True],
        [False, False],
        [False, False]],

       [[ True, False],
        [False, False],
        [False, False]]])

In [45]: mask = np.any(mask, axis=1)

In [46]: mask
Out[46]:
array([[ True,  True],
       [False,  True],
       [ True, False]])

In [47]: mask = np.all(mask, axis=1)

In [48]: mask
Out[48]: array([ True, False, False])

To do this in more shortly: 要在更短的时间内完成此操作:

# Define data
matrix = np.array([['A', 'B', 'C'], ['A', 'B', np.nan], ['C', np.nan, np.nan] ])
samples = ['C', 'A']

# Solution
mask = np.all(np.any(matrix[...,None] == np.array(samples)[None,None], axis=1), axis=1)

Take note that this will probably not do well with large sparse matrices.... 请注意,对于大型稀疏矩阵,这可能不会很好....

Here is pseudocode that might help you: 这是伪代码可能对您有所帮助:

idxRows = []
for idx, i in enumerate(mask):
    if True in i:
        idxRows.append(idx)

This will give you the indices of all the rows that contain said samples. 这将为您提供包含所述样本的所有行的索引。

I finally use: 我终于使用:

#Filter
test_elements = ['A', 'B']
mask = np.isin(matrix, test_elements)
vec_mask = np.isin(mask.sum(axis=1), [len(test_elements)])
ids = np.where(vec_mask)
existence = matrix[ids]

Thank you for you help guys. 谢谢你的帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM