如何在python中使用正则表达式的unicode字符串？

Question

Hi I wanna use regular expression for unicode utf-8 in following string: 嗨，我想在以下字符串中使用正则表达式unicode utf-8：

</td><td>عـــــــــــادي</td><td> 40.00</td>

I want to pick "عـــــــــــادي" out, how Can I do this? 我想选择"عـــــــــــادي" ，我该怎么做？

My code for this is : 我的代码是：

state = re.findall(r'td>...</td',s)

Thanks 谢谢

Answer 1

I ran across something similar when trying to match a string in Russian. 当我试图用俄语匹配一个字符串时，我碰到了类似的东西。 For your situation, Michele's answer works fine. 根据您的情况，Michele的答案很好。 If you want to use special sequences like \\w and \\s , though, you have to change some things. 但是，如果你想使用像\\w和\\s这样的特殊序列，你必须改变一些东西。 I'm just sharing this, hoping it will be useful to someone else. 我只是分享这个，希望它对其他人有用。

>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"

Make your string unicode by placing a u before the quotation marks 通过在引号前放置一个u使字符串成为unicode

>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)

Set the flag to unicode, so that it will match unicode strings as well (see docs ). 将标志设置为unicode，以便它也匹配unicode字符串（请参阅docs ）。

(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я] , so: （或者，您可以使用当地语言设置范围。对于俄语，这将是[а-яА-Я] ，因此：

pattern = re.compile(ur'>([а-яА-Я\s]+)<')

In that case, you don't have to set a flag anymore, since you're not using a special sequence.) 在这种情况下，您不必再设置标志，因为您没有使用特殊序列。）

>>> match = pattern.findall(string)
>>> for i in match:
...     print i
... 
Я люблю мороженое

Answer 2

According to PEP 0264: Defining Python Source Code Encodings , first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line: 根据PEP 0264：定义Python源代码编码，首先你需要通过在第一行添加这样的注释来告诉Python整个源文件是UTF-8编码的：

# -*- coding: utf-8 -*-

Furthermore, try adding ' ur ' before the string so that it's raw and Unicode : 此外，尝试在字符串之前添加“ ur ”，以便它是原始的 和 Unicode ：

state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)

I've also edited your regex to make it match. 我还编辑了你的正则表达式以使其匹配。 Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected. 三个点意味着“正好三个字符”，但由于您使用的是UTF-8，这是一个多字节编码，因此可能无法正常工作。

如何在python中使用正则表达式的unicode字符串？

问题描述

2 个解决方案

解决方案1
4 2013-10-07 16:48:50

解决方案2
2 2012-02-25 17:42:32

如何在python中使用正则表达式的unicode字符串？

问题描述

2 个解决方案

解决方案1 4 2013-10-07 16:48:50

解决方案2 2 2012-02-25 17:42:32

解决方案1
4 2013-10-07 16:48:50

解决方案2
2 2012-02-25 17:42:32