简体   繁体   English

如何在Python中处理具有unicode的正则表达式?

[英]How do I process a regular expression having unicode in Python?

So, I have this string str = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk' in Python and I just want to extract the world-weather-online® part of it using regular expression. 因此,我在Python中有这个字符串str = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk' ,我只想使用正则表达式提取其中的world-weather-online®部分。 What I did is first match = re.search(r'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str) and then get the result in a string str2 = match.group(1) . 我要做的是先match = re.search(r'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str) ,然后将结果转换为字符串str2 = match.group(1)

However, I end up with the error 'NoneType' object has no attribute 'group' . 但是,我最终'NoneType' object has no attribute 'group'错误'NoneType' object has no attribute 'group' If I just try it with the string "world-weather-online_jkpahjicmehopmlkbenbkmckcedlcmhk", it works just fine. 如果我仅使用字符串“ world-weather-online_jkpahjicmehop​​mlkbenbkmckcedlcmhk”尝试,它就可以正常工作。 However, having the special unicode symbol creates a problem. 但是,使用特殊的unicode符号会带来问题。 I tried using match = re.search(ur'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str) but it still doesn't help. 我尝试使用match = re.search(ur'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str)但仍然无济于事。 Any ideas on how to solve this one? 关于如何解决这一问题的任何想法? Thanks! 谢谢!

Use a Unicode regular expression and include the codepoint in your pattern: 使用Unicode正则表达式,并在模式中包含代码点:

match = re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr)

You may want to think about what other codepoints should be included, apart from the trademark ® codepoint. 您可能要考虑除商标®代码点之外还应该包括哪些其他代码点。

Demo: 演示:

>>> import re
>>> yourstr = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk'
>>> print re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr).group(1)
world-weather-online®

Well, I think that you only forgot the ® in your regexp: 好吧,我认为您只在正则表达式中忘记了®:

>>> match = re.search(r'([a-zA-Z0-9\-\%\+®+]+?)_[a-z]+', str)
>>> match.group(1)
u'world-weather-online\xae'

But if your string contains more unicode characters, your regexp can be long… So just re.search(r'(.*)_[az]+', str) can do the trick. 但是,如果您的字符串包含更多的Unicode字符,则您的正则表达式可能会很长……因此,只要re.search(r'(.*)_[az]+', str)就能解决问题。

And if you just want to split wrt to the '_': 如果只想将wrt拆分为“ _”:

>>> str.split('_')[0]
u'world-weather-online\xae'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM