[英]How do I process a regular expression having unicode in Python?
So, I have this string str = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk'
in Python and I just want to extract the world-weather-online®
part of it using regular expression. 因此,我在Python中有这个字符串str = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk'
,我只想使用正则表达式提取其中的world-weather-online®
部分。 What I did is first match = re.search(r'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str)
and then get the result in a string str2 = match.group(1)
. 我要做的是先match = re.search(r'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str)
,然后将结果转换为字符串str2 = match.group(1)
。
However, I end up with the error 'NoneType' object has no attribute 'group'
. 但是,我最终'NoneType' object has no attribute 'group'
错误'NoneType' object has no attribute 'group'
。 If I just try it with the string "world-weather-online_jkpahjicmehopmlkbenbkmckcedlcmhk", it works just fine. 如果我仅使用字符串“ world-weather-online_jkpahjicmehopmlkbenbkmckcedlcmhk”尝试,它就可以正常工作。 However, having the special unicode symbol creates a problem. 但是,使用特殊的unicode符号会带来问题。 I tried using match = re.search(ur'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str)
but it still doesn't help. 我尝试使用match = re.search(ur'([a-zA-Z0-9\\-\\%\\+]+?)_[az]+', str)
但仍然无济于事。 Any ideas on how to solve this one? 关于如何解决这一问题的任何想法? Thanks! 谢谢!
Use a Unicode regular expression and include the codepoint in your pattern: 使用Unicode正则表达式,并在模式中包含代码点:
match = re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr)
You may want to think about what other codepoints should be included, apart from the trademark ®
codepoint. 您可能要考虑除商标®
代码点之外还应该包括哪些其他代码点。
Demo: 演示:
>>> import re
>>> yourstr = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk'
>>> print re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr).group(1)
world-weather-online®
Well, I think that you only forgot the ® in your regexp: 好吧,我认为您只在正则表达式中忘记了®:
>>> match = re.search(r'([a-zA-Z0-9\-\%\+®+]+?)_[a-z]+', str)
>>> match.group(1)
u'world-weather-online\xae'
But if your string contains more unicode characters, your regexp can be long… So just re.search(r'(.*)_[az]+', str)
can do the trick. 但是,如果您的字符串包含更多的Unicode字符,则您的正则表达式可能会很长……因此,只要re.search(r'(.*)_[az]+', str)
就能解决问题。
And if you just want to split wrt to the '_': 如果只想将wrt拆分为“ _”:
>>> str.split('_')[0]
u'world-weather-online\xae'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.