I'm trying to create a simple function to replace only accented characters with normal ones:
import re
def remove_accents(r):
r = re.sub("[àáâãäå]", 'a', r)
r = re.sub("[èéêë]", 'e', r)
r = re.sub("[ìíîï]", 'i', r)
r = re.sub("[òóôõö]", 'o', r)
r = re.sub("[ùúûü]", 'u', r)
r = re.sub("[ýÿ]", 'y', r)
return r
The problem I'm having is the next one, when I try to replace the accented character with the normal one, Python is adding an extra character and I don't know why.
Example
import re
my_string = "Joaquín Noriega"
print re.sub(r"[ìíîï]", r'i', my_string)
This is what I get on my output:
Output: 'Joaquiin Noriega'
Note the double ' ii ' on the name, it should be ' Joaquin Noriega '
Python 2 strings are bytes, so as UTF-8, the regex really looks like this:
'[\xc3\xac\xc3\xad\xc3\xae\xc3\xaf]'
í
in "Joaquín Noriega"
is encoded the same way – as two bytes – and they both match the character class, so they're both replaced with the single-byte i
.
The preferable fix to this is to switch to Python 3 (it has sane text handling), but if you can't, Unicode strings will do:
import re
my_string = u"Joaquín Noriega"
print re.sub(u"[ìíîï]", u'i', my_string)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.