简体   繁体   中英

String replacement with dictionary, complications with punctuation

I'm trying to write a function process(s,d) to replace abbreviations in a string with their full meaning by using a dictionary. where s is the string input and d is the dictionary. For example:

>>>d = {'ASAP':'as soon as possible'}
>>>s = "I will do this ASAP.  Regards, X"
>>>process(s,d)
>>>"I will do this as soon as possible.  Regards, X"

I have tried using the split function to separate the string and compare each part with the dictionary.

def process(s):
    return ''.join(d[ch] if ch in d else ch for ch in s)

However, it returns me the same exact string. I have a suspicion that the code doesn't work because of the full stop behind ASAP in the original string. If so, how do I ignore the punctuation and get ASAP to be replaced?

Here is a way to do it with a single regex:

In [24]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}

In [25]: s = 'I will do this ASAP, AFAIK.  Regards, X'

In [26]: re.sub(r'\b' + '|'.join(d.keys()) + r'\b', lambda m: d[m.group(0)], s)
Out[26]: 'I will do this as soon as possible, as far as I know.  Regards, X'

Unlike versions based on str.replace() , this observes word boundaries and therefore won't replace abbreviations that happen to appear in the middle of other words (eg "etc" in "fetch").

Also, unlike most (all?) other solutions presented thus far, it iterates over the input string just once, regardless of how many search terms there are in the dictionary.

You can do something like this:

def process(s,d):
    for key in d:
        s = s.replace(key,d[key])
    return s

Here is a working solution: use re.split() , and split by word boundaries (preserving the interstitial characters):

''.join( d.get( word, word ) for word in re.split( '(\W+)', s ) )

One significant difference that this code has from Vaughn's or Sheena's answer is that this code takes advantage of the O(1) lookup time of the dictionary, while their solutions look at every key in the dictionary. This means that when s is short and d is very large, their code will take significantly longer to run. Furthermore, parts of words will still be replaced in their solutions: if d = { "lol": "laugh out loud" } and s="lollipop" their solutions will incorrectly produce "laugh out loudlipop" .

use regular expressions:

re.sub(pattern,replacement,s)

In your application:

ret = s
for key in d:
    ret = re.sub(r'\b'+key+r'\b',d[key],ret)
return ret

\\b matches the beginning or end of a word. Thanks Paul for the comment

Instead of splitting by spaces, use:

split("\W")

It will split by anything that's not a character that would be part of a word.

This is string replacement as well (+1 to @VaughnCato). This uses the reduce function to iterate through your dictionary, replacing any instances of the keys in the string with the values. s in this case is the accumulator, which is reduced (ie fed to the replace function) on every iteration, maintaining all past replacements (also, per @PaulMcGuire's point above, this replaces keys starting with the longest and ending with the shortest).

In [1]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}

In [2]: s = 'I will do this ASAP, AFAIK.  Regards, X'

In [3]: reduce(lambda x, y: x.replace(y, d[y]), sorted(d, key=lambda i: len(i), reverse=True), s)
Out[3]: 'I will do this as soon as possible, as far as I know.  Regards, X'

As for why your function didn't return what you expected - when you iterate through s , you are actually iterating through the characters of the string - not the words. Your version could be tweaked by iterating over s.split() (which would be a list of the words), but you then run into an issue where the punctuation is causing words to not match your dictionary. You can get it to match by importing string and stripping out string.punctuation from each word, but that will remove the punctuation from the final string (so regex would be likely be the best option if replacement doesn't work).

    python 3.2

    [s.replace(i,v) for i,v in d.items()]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM