简体   繁体   English

比较2维numpy数组和1维numpy数组

[英]Compare 2-dimensional numpy array with 1-dimensional numpy array

I have a numpy array a of shape (m1,m2) with string entries. 我有一个numpy的阵列a形状(m1,m2)与字符串条目。 I compare the entries of this array a with a one dimensional numpy array containing strings ( arr ). 我将此数组a的条目与包含字符串( arr )的一维numpy数组进行比较。 The one dimensional array arr is of shape (n,) , where n is a big number (~10,000) 一维数组arr的形状为(n,) ,其中n是一个大数(〜10,000)

An example of the array a can be found here (file.txt) . 数组a的示例可以在这里(file.txt)找到。 An example of the array arr can be found here (arr.txt) . 可以在此处(arr.txt)找到数组arr的示例。

This is how I compare arr to rows in a . 这就是我将arrarr中的行进行比较a If an element of arr is found in any row of a , then I save the index of that element from arr in a new list ( comp + str(i).zfill(5) ): 如果一个元件arr中的任意一行中找到a ,则我是从保存该元素的索引arr在一个新的列表( comp + str(i).zfill(5)

import pandas as pd
import numpy as np

a = pd.read_csv('file.txt', error_bad_lines=False, sep=r'\s+', header=None).values[:,1:].astype('<U1000')

arr = np.genfromtxt('arr.txt',dtype='str')

for i in range(a.shape[0]):
    globals()['comp' + str(i).zfill(5)] = []
    for j in range(len(arr)):
        if arr[j] in set(a[i, :]): globals()['comp' + str(i).zfill(5)] += [j] 

But, the code that I have above is really slow (it takes ~15-20 mins). 但是,我上面的代码确实很慢(大约需要15-20分钟)。 I am wondering if there is a faster way to achieve the task that I want . 我想知道是否有更快的方法来完成我想要的任务 Any suggestion will be appreciated. 任何建议将不胜感激。

I couldn't reliably read your file.txt so used up a small subset of it. 我无法可靠地读取您的file.txt,因此用尽了其中的一小部分。 I converted 'arr' into a dictionary, called 'lu' with the text as the keys and the index positions as the values. 我将“ arr”转换成字典,称为“ lu”,以文本为键,索引位置为值。

In [132]: a=np.array([['onecut2', 'ttc14', 'zadh2', 'pygm', 'tiparp', 'mgat4a', 'man2a1', 'zswim5', 'tubd1', 'igf2bp3'],
 ...: ['pou2af1', 'slc25a12', 'zbtb25', 'unk', 'aif1', 'tmem54', 'apaf1', 'dok2', 'fam60a', 'rab4b'],
 ...: ['rara', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
 ...: ['rarb', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
 ...: ['rarg', 'kcnk4', 'gfer', 'trip10', 'cog6', 'srebf1', 'zgpat', 'rxrb', 'clcf1', 'fyttd1'],
 ...: ['pou5f1', 'slc25a12', 'zbtb25', 'unk', 'aif1', 'tmem54', 'apaf1', 'dok2', 'fam60a', 'rab4b'],
 ...: ['apc', 'rab34', 'lsm3', 'calm2', 'rbl1', 'gapdh', 'prkce', 'rrm1', 'irf4', 'actr1b']])

In [133]: def do_analysis(src, lu):
 ...:     res={}  # Initialise result to an empty dictionary
 ...:     for r, row in enumerate(src):
 ...:         temp_list=[]   # list to append results to in the inner loop
 ...:         for txt in row:
 ...:             exists=lu.get(txt, -1)   # lu returns the index of txt in arr, or -1 if not found.
 ...:             if exists>=0: temp_list.append(exists)   # If txt was found in a append it's index to the temp_list
 ...:         res['comp'+str(r).zfill(5)]=temp_list  
 ...:         # Once analysis of the row has finished store the list in the res dictionary
 ...:     return res

In [134]: lu=dict(zip(arr, range(len(arr)))) 
          # Turn the array 'arr' into a dictionary which returns the index of the corresponding text.

In [135]: lu
Out[135]: 
{'pycrl': 0, 'gpr180': 1, 'gpr182': 2, 'gpr183': 3, 'neurl2': 4,
 ...
 'hcn2': 999,   ...}

In [136]: do_analysis(a, lu)
Out[136]: 
{'comp00000': [6555, 3682, 7282, 1868, 5522, 9128, 1674, 8695, 156],
 'comp00001': [6006, 3846, 8185, 8713, 5806, 4912, 597, 7565, 3003],
 'comp00002': [9355, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
 'comp00003': [9356, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
 'comp00004': [9358, 3717, 1654, 2386, 6309, 7396, 3825, 2135, 6596, 7256],
 'comp00005': [6006, 3846, 8185, 8713, 5806, 4912, 597, 7565, 3003],
 'comp00006': [8916, 8588, 2419, 3656, 9015, 7045, 7628, 5519, 8793, 1946]}

In [137]: %timeit do_analysis(a, lu)
47.9 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Your file  262000 bytes
My array a    462 bytes
               48 µs

In [138]: 262000 / 462 * 48 / 1000000
Out[138]: 0.0272 seconds

If the array 'a' is a list of lists the analysis runs twice as fast as when 'a' is a numpy array. 如果数组“ a”是列表列表,则分析运行的速度是“ a”是numpy数组时的两倍。

I hope this does what you need or points you in the right direction. 我希望这能满足您的需求或为您指明正确的方向。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM