匹配python正则表达式中的unicode字符

Question

我已经阅读了 Stackoverflow 上的其他问题，但仍然没有接近。 对不起，如果这已经得到回答，但我没有得到任何建议在那里工作。

>>> import re
>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg')
>>> print m.groupdict()
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

一切都很好，然后我尝试了一些带有挪威字符的东西（或者更像 unicode 的东西）：

>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg')
>>> print m.groupdict()
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groupdict'

如何匹配典型的 unicode 字符，例如 øæå？ 我也希望能够在上面的标签组和文件名中匹配这些字符。

Answer 1

您需要指定re.UNICODE标志，并使用u前缀将字符串输入为 Unicode 字符串：

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

这是在 Python 2 中； 在 Python 3 中，您必须省略u ，因为所有字符串都是 Unicode，并且您可以re.UNICODE标志。

Answer 2

您需要UNICODE标志：

m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)

Answer 3

["

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
u',./___\uff0c___-=+'
>>> print re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
,./___，___-=+

匹配python正则表达式中的unicode字符

问题描述

3 个解决方案

解决方案1
49 已采纳 2011-02-17 12:18:18

解决方案2
13 2011-02-17 12:12:47

解决方案3
6 2012-10-25 05:46:29

匹配python正则表达式中的unicode字符

问题描述

3 个解决方案

解决方案1 49 已采纳 2011-02-17 12:18:18

解决方案2 13 2011-02-17 12:12:47

解决方案3 6 2012-10-25 05:46:29

解决方案1
49 已采纳 2011-02-17 12:18:18

解决方案2
13 2011-02-17 12:12:47

解决方案3
6 2012-10-25 05:46:29