Finding and substituting a list of words in a file using regex in Python

Question

I want to print the contents of a file to the terminal and in the process highlight any words that are found in a list without modifying the original file. Here's an example of the not-yet-working code:

    def highlight_story(self):
        """Print a line from a file and highlight words in a list."""

        the_file = open(self.filename, 'r')
        file_contents = the_file.read()

        for word in highlight_terms:
            regex = re.compile(
                  r'\b'      # Word boundary.
                + word       # Each item in the list.
                + r's{0,1}', # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            result = re.sub(regex, subst, file_contents)

        print result
        the_file.close()

highlight_terms = [
    'dog',
    'hedgehog',
    'grue'
]

As it is, only the last item in the list, regardless of what it is or how long the list is, will be highlighted. I assume that each substitution is performed and then "forgotten" when the next iteration begins. It looks something like this:

Grues have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent grue to a be delicacies. Dogs can frighten awat a grue , however, by barking in a musical scale. A hedgehog, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a grue king.

But it should look like this:

Grues have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent grue to a be delicacies. Dogs can frighten away a grue , however, by barking in a musical scale. A hedgehog , on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a grue king.

How can I stop the other substitutions from being lost?

Answer 1

You can modify your regex to the following:

regex = re.compile(r'\b('+'|'.join(highlight_terms)+r')s?', flags=re.IGNORECASE | re.VERBOSE)  # note the ? instead of {0, 1}. It has the same effect

Then, you won't need the for loop.

This code takes the list of words and then concatenates them together with a | . So if your list was something like:

a = ['cat', 'dog', 'mouse'];

The regex would be:

\b(cat|dog|mouse)s?

Answer 2

The regex provided is correct, but the for loop is where you got wrong.

result = re.sub(regex, subst, file_contents)

This line substitutes the regex with subst in the file_content .

in the second iteration, it again does the substitution in file_content where as you intented to do it on result

How to correct

result = file_contents

for word in highlight_terms:
    regex = re.compile(
          r'\b'      # Word boundary.
        + word       # Each item in the list.
        + r's?\b', # One optional 's' at the end.
        flags=re.IGNORECASE | re.VERBOSE)
    print regex.pattern
    subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
    result = re.sub(regex, subst, result) #change made here

 print result

Answer 3

you need to reassign file_contents each time through the loop to the replaced string, reassigning file_contents does not change the content in the file:

def highlight_story(self):
        """Print a line from a file and highlight words in a list."""

        the_file = open(self.filename, 'r')
        file_contents = the_file.read()
        output = ""
        for word in highlight_terms:
            regex = re.compile(
                  r'\b'      # Word boundary.
                + word       # Each item in the list.
                + r's{0,1}', # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            file_contents  = re.sub(regex, subst, file_contents) # reassign to updatedvalue
        print file_contents
        the_file.close()

Also using with to open files is a better way to go and you can make a copy of the string outside the loop and update inside:

def highlight_story(self):
    """Print a line from a file and highlight words in a list."""
    with open(self.filename) as the_file:
        file_contents = the_file.read()
        output = file_contents # copy
        for word in highlight_terms:
            regex = re.compile(
                r'\b'  # Word boundary.
                + word  # Each item in the list.
                + r's{0,1}',  # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            output = re.sub(regex, subst, output) # update copy
        print output
    the_file.close()

Finding and substituting a list of words in a file using regex in Python

Question

3 answers

solution1
5 ACCPTED 2014-11-08 19:41:46

solution2
4 2014-11-08 19:54:21

solution3
2 2014-11-08 19:52:33

Finding and substituting a list of words in a file using regex in Python

Question

3 answers

solution1 5 ACCPTED 2014-11-08 19:41:46

solution2 4 2014-11-08 19:54:21

solution3 2 2014-11-08 19:52:33

solution1
5 ACCPTED 2014-11-08 19:41:46

solution2
4 2014-11-08 19:54:21

solution3
2 2014-11-08 19:52:33