简体   繁体   中英

How can I use regular expression for unicode string in python?

Hi I wanna use regular expression for unicode utf-8 in following string:

</td><td>عـــــــــــادي</td><td> 40.00</td>

I want to pick "عـــــــــــادي" out, how Can I do this?

My code for this is :

state = re.findall(r'td>...</td',s)

Thanks

I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \\w and \\s , though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.

>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"

Make your string unicode by placing a u before the quotation marks

>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)

Set the flag to unicode, so that it will match unicode strings as well (see docs ).

(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я] , so:

pattern = re.compile(ur'>([а-яА-Я\s]+)<')

In that case, you don't have to set a flag anymore, since you're not using a special sequence.)

>>> match = pattern.findall(string)
>>> for i in match:
...     print i
... 
Я люблю мороженое

According to PEP 0264: Defining Python Source Code Encodings , first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:

# -*- coding: utf-8 -*-

Furthermore, try adding ' ur ' before the string so that it's raw and Unicode :

state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)

I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM