简体   繁体   English

Numpy 在二维数组中查找多个字符串

[英]Numpy find multiple strings in 2d array

I'm new to Numpy and it's been a while writing python.我是 Numpy 的新手,写 python 已经有一段时间了。

I'm struggeling to find multiple strings in a Numpy array which was sliced.我正在努力在被切片的 Numpy 数组中找到多个字符串。
My data:我的数据:

string0 = "part0-part1-part2-part3-part4"
string1 = "part5-part6-part9-part7-part8"
string2 = "part5-part6-part1-part8-part7"

Sliced in to each part and combined to one array again to have it all in one place.切片到每个部分并再次组合到一个阵列中,将它们全部放在一个地方。

stringsraw = np.array([[string0], [string1], [string2]])
stringssliced = np.array(np.char.split(stringsraw, sep = '-').tolist())
stringscombined = np.squeeze(np.dstack((stringsraw, stringssliced)))

Results in:结果是:

[['part0-part1-part2-part3-part4' 'part0' 'part1' 'part2' 'part3' 'part4']
 ['part5-part6-part9-part7-part8' 'part5' 'part6' 'part9' 'part7' 'part8']
 ['part5-part6-part1-part7-part8' 'part5' 'part6' 'part1' 'part8' 'part7']]

Want to find the indices of 'part1' and 'part7'想要找到“part1”和“part7”的索引

np.where((stringscombined[2] == "part1") & (stringscombined[2] == "part7"))

The result is nothing.结果什么都没有。 Can anyone explain why the result is not [3,4]?谁能解释为什么结果不是[3,4]?

Thought there would be a nicer way to not for loop through everything.认为会有更好的方法来不循环遍历所有内容。

The "whished" query/result would be: “whished”查询/结果将是:

np.where((stringscombined == "part6") & (stringscombined == "part7")) 
= array[[1,2,4]
        [2,2,5]]

any help appreciated任何帮助表示赞赏

We can first detect where the two elements will be, using np.isin:我们可以首先使用 np.isin 检测这两个元素的位置:

np.isin(stringscombined,["part1","part7"])
array([[False, False,  True, False, False, False],
       [False, False, False, False,  True, False],
       [False, False, False,  True, False,  True]])

Using np.where() on this will tell us where the elements can be found.使用np.where()将告诉我们在哪里可以找到元素。 We need one more information, which is which row has both "part1" and "part7":我们还需要一个信息,即哪一行同时包含“part1”和“part7”:

(np.sum(stringscombined=="part1",axis=1)>0) & (np.sum(stringscombined=="part7",axis=1)>0)

array([False, False,  True])

The above will tell us to take only indices from the 2nd row.以上将告诉我们仅从第二行获取索引。 Combining these two information into a function:将这两个信息组合成一个 function:

def index_A(Array,i1,i2):
    idx = (np.sum(Array==i1,axis=1)>0) & (np.sum(Array==i2,axis=1)>0)
    loc = np.where(np.isin(Array,[i1,i2]))
    hits = [np.insert(loc[1][loc[0]==i],0,i) for i in np.where(idx)[0]]
    return hits

index_A(stringscombined,"part6","part7")
[array([1, 2, 4]), array([2, 2, 5])]

We can simplify dimensions a bit with:我们可以通过以下方式简化尺寸:

In [475]: stringsraw = np.array([string0, string1, string2])                             
In [476]: stringsraw                                                                     
Out[476]: 
array(['part0-part1-part2-part3-part4', 'part5-part6-part9-part7-part8',
       'part5-part6-part1-part8-part7'], dtype='<U29')
In [477]: np.char.split(stringsraw, sep='-')                                             
Out[477]: 
array([list(['part0', 'part1', 'part2', 'part3', 'part4']),
       list(['part5', 'part6', 'part9', 'part7', 'part8']),
       list(['part5', 'part6', 'part1', 'part8', 'part7'])], dtype=object)
In [478]: np.stack(_)                                                                    
Out[478]: 
array([['part0', 'part1', 'part2', 'part3', 'part4'],
       ['part5', 'part6', 'part9', 'part7', 'part8'],
       ['part5', 'part6', 'part1', 'part8', 'part7']], dtype='<U5')
In [479]: arr = _                        

A list comprehension would be just as good (and fast):列表理解同样好(而且很快):

In [491]: [str.split('-') for str in [string0, string1, string2]]                        
Out[491]: 
[['part0', 'part1', 'part2', 'part3', 'part4'],
 ['part5', 'part6', 'part9', 'part7', 'part8'],
 ['part5', 'part6', 'part1', 'part8', 'part7']]
In [492]: np.array(_)                                                                    
Out[492]: 
array([['part0', 'part1', 'part2', 'part3', 'part4'],
       ['part5', 'part6', 'part9', 'part7', 'part8'],
       ['part5', 'part6', 'part1', 'part8', 'part7']], dtype='<U5')

And then do equality tests on slices or the whole array:然后对切片或整个数组进行相等测试:

In [488]: np.nonzero((arr[2]=='part1')|(arr[2]=='part7'))                                
Out[488]: (array([2, 4]),)
In [489]: arr=='part1'                                                                   
Out[489]: 
array([[False,  True, False, False, False],
       [False, False, False, False, False],
       [False, False,  True, False, False]])
In [490]: np.nonzero(_)                                                                  
Out[490]: (array([0, 2]), array([1, 2]))

In [493]: np.in1d(arr[2],['part1','part7'])                                              
Out[493]: array([False, False,  True, False,  True])

There's nothing special about numpy's handling of strings. numpy's处理没有什么特别之处。

np.isin also works. np.isin也有效。 It uses in1d .它使用in1d If one argument is small, it actually does the repeated |如果一个参数很小,它实际上会重复| as in [488]:如[488]:

In [501]: np.isin(arr,['part1','part7'])                                                 
Out[501]: 
array([[False,  True, False, False, False],
       [False, False, False,  True, False],
       [False, False,  True, False,  True]])
In [502]: np.nonzero(_)                                                                  
Out[502]: (array([0, 1, 2, 2]), array([1, 3, 2, 4]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM