Regex on Python prints Garbage when use char class

Question

Python 2.7

I am processing a utf-8 encoded file (greek) and seems that regex has some issues.

Regex seems to work fine when i do not use char class. When i do :

        text = re.sub('αυ','kk',text,flags=re.UNICODE)

everything works fine and for instance 'αυτιά' will be converted to 'kkτιά'.

However, when i want to use char class like :

        text = re.sub('αυ[τ]','kk',text,flags=re.UNICODE)

garbage character is shown and 'αυτιά' is converted to 'kk ia'. Is it an encoding issue or is something wrong with my regex ? Excuse me but i am pretty new to regex mindset.

Thanks!

Answer 1

Pass unicode objects instead of strings:

>>> print re.sub('αυ[τ]', 'kk', 'αυτιά', flags=re.UNICODE)
kk▒ιά
>>> print re.sub(u'αυ[τ]', u'kk', u'αυτιά', flags=re.UNICODE)
kkιά

Regex on Python prints Garbage when use char class

Question

1 answers

solution1
3 ACCPTED 2015-01-11 00:36:59

Regex on Python prints Garbage when use char class

Question

1 answers

solution1 3 ACCPTED 2015-01-11 00:36:59

solution1
3 ACCPTED 2015-01-11 00:36:59