简体   繁体   中英

Python: Why is re.sub not replacing dict key with dict value when there is ',' in dict key

somewhat of a python/programming newbie here. First up, the code:

import re
patt_list = ['However,', 'phenomenal', 'brag']
dictionary = {'phenomenal': 'phenomenal|extraordinary|remarkable|incredible', 'However,': 'However,|Nevertheless,|Nonetheless,', 'brag': 'brag|boast'}

def replacer_factory1(dictionary):
    def replacing(match):
        if len(dictionary) > 0:
            word = match.group()
            exchange = dictionary.get(word, word)
            spintax = '{' + exchange + '}'
            create_place_holder = spintax.replace(' ', '#!#')
            return create_place_holder
        else:
            return ""
    return replacing

def replacing1(text):
    regex_patt_list = r'\b(?:' + '|'.join(patt_list) + r')\b'
    replacer = replacer_factory1(dictionary)
    return re.sub(regex_patt_list, replacer, text)

with open('test_sent.txt', 'r+') as sent:
    read_sent = sent.read()
    sent.seek(0)
    sent.write(replacing1(read_sent))

So the code I created here searches the text file test_sent.txt for words that I have in the list called patt_list . If the words are in the text file, then re.sub is used to replace the keys in the dictionary called dictionary with the corresponding values in that dictionary, then write those changes back to the text file. (This code is actually part of a bigger script in which the keys of the dictionary are created from patt_list , just in case you where wondering why there is a need for patt_list here at all).

However, the problem that I have with this code is that the dictionary key However, is not replaced with its corresponding value However,|Nevertheless,|Nonetheless, - whereas the rest of the key:value replacements work just fine, and are written to the text file.

I believe it may be the comma in However, that is causing this problem because I tried another key:value with a comma at the end of the key and this did not work either.

Can anyone enlighten me to why this is happening?

Contents of 'test_sent.txt' before running code:

Quite phenomenal. However, nothing to brag about?

Contents of 'test_sent.txt' after running code:

Quite {phenomenal|extraordinary|remarkable|incredible}. However, nothing to {brag|boast} about?

What I actually want the output to look like:

Quite {phenomenal|extraordinary|remarkable|incredible}. {However,|Nevertheless,|Nonetheless,} nothing to {brag|boast} about bragg's vinegar?

What I don't want (a partial match on bragg's ):

Quite {phenomenal|extraordinary|remarkable|incredible}. {However,|Nevertheless,|Nonetheless,} nothing to {brag|boast} about {brag|boast}g's vinegar?

EDIT: In response to the helpful answer by 'WKPLUS' below, removing the \\b from the end of regex_patt_list works here, but not for the greater use I have this code. The dictionary is much bigger in reality, so when the \\b is removed, I get partial matches in the text, which I don't want. I updated the test_sent.txt to add the words bragg's vinegar at the end to illustrate the partial match issue when removing the \\b .

Remove the second "\\b" in the regex_patt_list will solve your problem.

def replacer_factory1(dictionary):
    def replacing(match):
        if len(dictionary) > 0:
            word = match.group()[:-1]
            exchange = dictionary.get(word, word)
            spintax = '{' + exchange + '}'
            create_place_holder = spintax.replace(' ', '#!#')
            return create_place_holder + match.group()[-1]
        else:
            return ""
    return replacing

def replacing1(text):
    regex_patt_list = r'\b(?:' + '|'.join(patt_list) + r')\W'
    replacer = replacer_factory1(dictionary)
    return re.sub(regex_patt_list, replacer, text)

A tricky solution for your problem.

I think I see the issue. The comma is not considered a "word character". So, in the string 'However,' the comma will actually be considered the ending word boundary, rather than the white space that comes after it. The regex pattern you have defined through use of the word boundary shortcut "\\b" is not matching that word because of this confusion.

Would it work out the way you want if you were to replace that final \\b with \\W (for non-word characters)?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM