Python 2.7
I am processing a utf-8 encoded file (greek) and seems that regex
has some issues.
Regex
seems to work fine when i do not use char class. When i do :
text = re.sub('αυ','kk',text,flags=re.UNICODE)
everything works fine and for instance 'αυτιά' will be converted to 'kkτιά'.
However, when i want to use char class like :
text = re.sub('αυ[τ]','kk',text,flags=re.UNICODE)
garbage character is shown and 'αυτιά' is converted to 'kk ia'. Is it an encoding issue or is something wrong with my regex
? Excuse me but i am pretty new to regex
mindset.
Thanks!
Pass unicode objects instead of strings:
>>> print re.sub('αυ[τ]', 'kk', 'αυτιά', flags=re.UNICODE)
kk▒ιά
>>> print re.sub(u'αυ[τ]', u'kk', u'αυτιά', flags=re.UNICODE)
kkιά
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.