简体   繁体   中英

Python regex extra character on re.sub (Regex replacement)

I'm trying to create a simple function to replace only accented characters with normal ones:

import re

def remove_accents(r):
    r = re.sub("[àáâãäå]", 'a', r)
    r = re.sub("[èéêë]", 'e', r)
    r = re.sub("[ìíîï]", 'i', r)
    r = re.sub("[òóôõö]", 'o', r)
    r = re.sub("[ùúûü]", 'u', r)
    r = re.sub("[ýÿ]", 'y', r)

    return r

The problem I'm having is the next one, when I try to replace the accented character with the normal one, Python is adding an extra character and I don't know why.

Example

import re

my_string = "Joaquín Noriega"
print re.sub(r"[ìíîï]", r'i', my_string)

This is what I get on my output:

Output: 'Joaquiin Noriega'

Note the double ' ii ' on the name, it should be ' Joaquin Noriega '

  • Why is this happening ? Is there something wrong with my regex ?

Python 2 strings are bytes, so as UTF-8, the regex really looks like this:

'[\xc3\xac\xc3\xad\xc3\xae\xc3\xaf]'

í in "Joaquín Noriega" is encoded the same way – as two bytes – and they both match the character class, so they're both replaced with the single-byte i .

The preferable fix to this is to switch to Python 3 (it has sane text handling), but if you can't, Unicode strings will do:

import re

my_string = u"Joaquín Noriega"
print re.sub(u"[ìíîï]", u'i', my_string)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM