如何在Python中处理具有unicode的正则表达式？

Question

So, I have this string str = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk' in Python and I just want to extract the world-weather-online® part of it using regular expression. 因此，我在Python中有这个字符串str = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk' ，我只想使用正则表达式提取其中的world-weather-online®部分。 What I did is first match = re.search(r'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str) and then get the result in a string str2 = match.group(1) . 我要做的是先match = re.search(r'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str) ，然后将结果转换为字符串str2 = match.group(1) 。

However, I end up with the error 'NoneType' object has no attribute 'group' . 但是，我最终'NoneType' object has no attribute 'group'错误'NoneType' object has no attribute 'group' 。 If I just try it with the string "world-weather-online_jkpahjicmehopmlkbenbkmckcedlcmhk", it works just fine. 如果我仅使用字符串“ world-weather-online_jkpahjicmehopmlkbenbkmckcedlcmhk”尝试，它就可以正常工作。 However, having the special unicode symbol creates a problem. 但是，使用特殊的unicode符号会带来问题。 I tried using match = re.search(ur'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str) but it still doesn't help. 我尝试使用match = re.search(ur'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str)但仍然无济于事。 Any ideas on how to solve this one? 关于如何解决这一问题的任何想法？ Thanks! 谢谢！

Answer 1

Use a Unicode regular expression and include the codepoint in your pattern: 使用Unicode正则表达式，并在模式中包含代码点：

match = re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr)

You may want to think about what other codepoints should be included, apart from the trademark ® codepoint. 您可能要考虑除商标®代码点之外还应该包括哪些其他代码点。

Demo: 演示：

>>> import re
>>> yourstr = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk'
>>> print re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr).group(1)
world-weather-online®

Answer 2

Well, I think that you only forgot the ® in your regexp: 好吧，我认为您只在正则表达式中忘记了®：

>>> match = re.search(r'([a-zA-Z0-9\-\%\+®+]+?)_[a-z]+', str)
>>> match.group(1)
u'world-weather-online\xae'

But if your string contains more unicode characters, your regexp can be long… So just re.search(r'(.*)_[az]+', str) can do the trick. 但是，如果您的字符串包含更多的Unicode字符，则您的正则表达式可能会很长……因此，只要re.search(r'(.*)_[az]+', str)就能解决问题。

And if you just want to split wrt to the '_': 如果只想将wrt拆分为“ _”：

>>> str.split('_')[0]
u'world-weather-online\xae'

如何在Python中处理具有unicode的正则表达式？

问题描述

2 个解决方案

解决方案1
3 已采纳 2014-04-01 11:11:38

解决方案2
2 2014-04-01 11:07:52

如何在Python中处理具有unicode的正则表达式？

问题描述

2 个解决方案

解决方案1 3 已采纳 2014-04-01 11:11:38

解决方案2 2 2014-04-01 11:07:52

解决方案1
3 已采纳 2014-04-01 11:11:38

解决方案2
2 2014-04-01 11:07:52