简体   繁体   中英

Regex on Python prints Garbage when use char class

Python 2.7

I am processing a utf-8 encoded file (greek) and seems that regex has some issues.

Regex seems to work fine when i do not use char class. When i do :

        text = re.sub('αυ','kk',text,flags=re.UNICODE)

everything works fine and for instance 'αυτιά' will be converted to 'kkτιά'.

However, when i want to use char class like :

        text = re.sub('αυ[τ]','kk',text,flags=re.UNICODE)

garbage character is shown and 'αυτιά' is converted to 'kk ia'. Is it an encoding issue or is something wrong with my regex ? Excuse me but i am pretty new to regex mindset.

Thanks!

Pass unicode objects instead of strings:

>>> print re.sub('αυ[τ]', 'kk', 'αυτιά', flags=re.UNICODE)
kk▒ιά
>>> print re.sub(u'αυ[τ]', u'kk', u'αυτιά', flags=re.UNICODE)
kkιά

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM