使用字符串條件從numpy數組中提取值

Question

所以這是我在這里的第一篇文章，英語不是我的母語，我會盡量說清楚。

我有一個來自一個形狀（基本上是一個數據表）的 numpy 數組，其中包含以下內容：

[('information1',   'identifier1',              length1)
('information2',    'identifier2',              length2)
('information3',    'identifier3,identifier4',  length3)
....
]

在哪里：

informationx是一個string ，
identifier是一個string ，在一個string包含一個或多個 id，
length是一個float 。

我需要從這個數組中提取包含有關一個標識符的信息的所有行。

在 SQL 我會做

select * from array where id like "%identifier1%"

當只有一個標識符時很容易：

extract = array[array[id_header] == identifier1]

是否有任何優雅和 Pythonic 的方式來做到這一點（也許通過提取、選擇或在哪里）？

Answer 1

這是熊貓中的一項簡單任務，考慮到您可以使用熊貓，將數組轉換為熊貓數據框，使用
import pandas as pd df = pd.dataFrame([your_array]) #creating data_frame df.columns = ['col_1','col_2','col_3'] #setting column names

考慮到您已將名稱 col_1,col_2,col_3 設置為您的列。

使用此代碼子選擇所需的列。

df_subset = df[ df['col_2'].str.contains('identifierx') ] #subselecting the data frame.

考慮到你不能使用pandas，只能使用numpy。

new_lis = []
for idx in range(0,len(your_array)):
    if( 'identifierx' in your_array[idx][1]):
        new_lis.append(your_array[idx])

Answer 2

您可以循環遍歷每個索引以查看標識符是否是您想要的：

 lengths = []
 for i in range(array.size[0]): #this should iterate through each row in the table 
      if array[i][1] == "identifierx":
           lengths.append(array[i][2]) #adds the lenghts to a list containing all the lengths from the identifier you want

Answer 3

這是一個漂亮的 numpy 解決方案！ 只想添加列表合成版本：

在 (1000012, 3) 數組上運行這些值並填充上述值以查詢搜索並獲得以下時間：

%%time
new_lis = []
for idx in range(0,len(huge_data)):
    if('identifier3' in huge_data[idx][1]):
        new_lis.append(huge_data[idx])

返回牆時間：875 毫秒

對於列表組合：

new_lis = [idx for idx in range(0, len(huge_data)) if ('identifier3' in huge_data[idx][1])]

返回牆時間：772 毫秒

但是是的 - 我試圖用 list comp + numpy 索引來解決，但為了捕捉字符串，我使用了正則表達式，所以它把它減慢到 ~4.5s wah waaaah 。

好問題，好答案！

使用字符串條件從numpy數組中提取值

問題描述

3 個解決方案

解決方案1
1 已采納 2018-10-16 14:16:19

解決方案2
0 2018-10-16 14:08:16

解決方案3
0 2020-07-02 17:36:32

使用字符串條件從numpy數組中提取值

問題描述

3 個解決方案

解決方案1 1 已采納 2018-10-16 14:16:19

解決方案2 0 2018-10-16 14:08:16

解決方案3 0 2020-07-02 17:36:32

解決方案1
1 已采納 2018-10-16 14:16:19

解決方案2
0 2018-10-16 14:08:16

解決方案3
0 2020-07-02 17:36:32