Regex taking too long during loop

Question

This is the simple version of my code.

    for i in range(len(holdList)):
        foundTerm = re.findall(r"\b" + self._searchTerm +
            r"\b", holdList[i][5], flags=re.IGNORECASE)
        # count the occurrence
        storyLen = len(foundTerm)
        holdList[i] += (storyLen,)
        if foundTerm:
            # Stores each found word as a list of strings
            # etc
            holdList[i] += (self.sentences_to_quote(holdList[i][5]), )

During the loop(the last line) I call upon a different method to look through each sentence and it returns that sentence that has the word. The holdList is a tuple from a MySQL query.

def sentences_to_quote(self, chapter):
    """
    Seperates the chapter into sentences
    Returns the first occurrence of the word in the sentence
    """

    # Seperate the chapters into sentences
    searchSentences = sent_tokenize.tokenize(chapter, realign_boundaries=True)
    findIt = r"\b" + self._searchTerm + r"\b"
    for word in searchSentences:
        regex = (re.sub(findIt,  
            "**" + self._searchTerm.upper() + "**", 
            word, flags=re.IGNORECASE))
        if regex != word:
            return regex

What can I do to speed this up? Is there anything I can do? The program is going through 10MB of text. Through profiling I found these two areas to be the bottleneck. I hope I provided enough info to make it clear.

Answer 1

I'm not sure whether your self._searchTerm will consist of phrases or words but in general you will get much better results from using set s and dict s rather than regex. You don't need the regex machinery in this case since all you want is to count/match complete words. To search for a certain word in a sentence, for example, you can easily replace this by:

search_sentence = set(sent_tokenize.tokenize(...))
if self._search_term in search_sentence:
    # yay

(I made your code PEP8 compliant.)

If you're worried about capitalization then convert everything to lower case:

self._search_term = self._search_term.lower()
search_sentence = set(word.lower() for word in sent_tokenize.tokenize(...))
if self._search_term in search_sentence:
    # yay

You can also count occurrences of words using a collection.Counter or collection.defaultdict(int) .

If you must use regex because you want to match words that follow a specific pattern rather than matching entire words then I suggest you compile the pattern once and then pass that pattern to the other methods, eg,

self.search_pattern = re.compile(r"\b{term}\b".format(term=self._search_term), re.I)
found_term = self.search_pattern.find_all(hold_list[i][5])

Answer 2

re.sub is used to replace the string if it matches the regex. your task here is only to find if a match exists, hence instead using re.search would give you a performance boost, re.search gives you the first match.

Regex taking too long during loop

Question

2 answers

solution1
2 ACCPTED 2014-05-26 09:59:09

solution2
1 2014-05-26 09:04:54

Regex taking too long during loop

Question

2 answers

solution1 2 ACCPTED 2014-05-26 09:59:09

solution2 1 2014-05-26 09:04:54

solution1
2 ACCPTED 2014-05-26 09:59:09

solution2
1 2014-05-26 09:04:54