简体   繁体   中英

Performance - Fastest way to compare 2 large lists of strings in Python

I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences.

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put ** around the phrase and add it to another list. I also need to do this in the fastest possible way.

This is what I have so far:

import re
for sentence in sentences:
    for phrase in phrases:
        if phrase in sentence.lower():
            iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
            newsentence = iphrase.sub("**"+phrase+"**", sentence)
            newlist.append(newsentence)

So far this approach takes about 60 seconds to complete.

I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this.

I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings.

So essentially, what would be the fastest possible way to perform this check?

Build your regex once, sorting by longest phrase so you encompass the ** s around the longest matching phrases rather than the shortest, perform the substitution and filter out those that have no substitution made, eg:

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    'can be really really',
    'characters',
    'some sentences'
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]

Gives you a results of:

['**some sentences** are longer',
 '**some sentences** **can be really really** ... really long, about 1000 **characters**.']

What about set comprehension?

found = {'**' + p + '**' for s in sentences for p in phrases if p in s}

You could try update (by reduction) the phrases list if you don't mind altering it:

found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
    for i in range(len(phrases)):
        phrase = phrases[i]
        if phrase in s:
            p.remove(phrase)
            found.append('**'+ phrase + '**')
    phrases = p[:]

Basically each iteration reduces the phrases container. We iterate through the latest container until we find a phrase that is in at least one sentence.

We remove it from the copied list then once we checked the latest phrases, we update the container with the reduced subset of phrases (those that haven't been seen yet). We do this since we only need to see a phrase at least once , so checking again (although it may exist in another sentence) is unnecessary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM