简体   繁体   中英

Fastest way to find matching index between two lists in python?

I have two lists as

listA = ['123', '345', '678']
listB = ['ABC123', 'CDE455', 'GHK678', 'CGH345']

I want to find the position of listB that matched with each element in listA. For example, the expected output is

0 3 2

where 123 appears in the fist element of listB so result returns 0, 345 appears in fourth postion of listB so it is 3. Note that the number of element in two list is very huge (about 500K elements) so the for loop is too slow. Have you suggest any faster solution? This is my solution

for i in range (len(listA)):
    for j in range (len(listB)):
        if listA[i] in listB[j]:
            print ('Postion ', j)

You can try like this. We know finding something in dictionary is fastest so the solution should use dictionary for the task completion.

In [1]: import re                                                                        

In [2]: listA = ['123', '345', '678']                                                    

In [3]: listB = ['ABC123', 'CDE455', 'GHK678', 'CGH345']                                 

In [4]: # Mapping b/w number in listB to related index                                   

In [5]: mapping = {re.sub(r'\D+', '', value).strip(): index for index, value in enumerate(listB)}                                                                         

In [6]: mapping # Print mapping dictionary                                               
Out[6]: {'123': 0, '455': 1, '678': 2, '345': 3}

In [7]: # Find the desired output                                                        

In [8]: output = [mapping.get(item) for item in listA]                                   

In [9]: output                                                                           
Out[9]: [0, 3, 2]

In [10]:   

Attached screenshot »

在此处输入图片说明

It essentially depends on your dataset. If you're given a sufficiently large enough dataset that you require low complexity, I'd suggest looking into the aho corasick algorithm . The gist of it is that you'd preprocess listA such that it becomes a trie whose nodes contain a failure link to the longest suffix of the current node in the trie. Because of this, you may simply iterate across each character in each word of listB and follow the trie you created from preprocessing. Thus your complexity adds the processing time of listA rather than it becoming multiplicative.

As a side note this doesn't decrease complexity in the case of a dynamic listA

Try adding all the elements in the list to a set() and searching it. It's supposed to have a much faster in test.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM