为什么在python中使用“ regex”不能匹配汉字？

Question

 import re
 html="""<div class="tB-mb">
                   <span class="t-d">0</span> 
                   <span class="t-d">0</span> 天 
                   <span class="t-h">0</span>
                   <span class="t-h">0</span> 时
                   <span class="t-m">0</span>
                   <span class="t-m">0</span> 分 
                   <span class="t-s">0</span>
                   <span class="t-s">0</span> 秒
     """
 tmp=re.compile(u"(<div class='tB-mb'>).*?([\u4e00-\u9fa5]).*?",re.U)
 result=re.findall(tmp,html.decode("utf-8"))
 print result
 []

如上所述，为什么我的代码不能匹配汉字？

Answer 1

Yuu在<div class='tB-mb'>为正则表达式使用单引号，而html在div的class中使用双引号。 我认为有一个更简单的模式可以提取您想要的内容：

tmp = re.compile(u"(?m)([\u4e00-\u9fa5])+", re.U)
result=re.findall(tmp,html)
print result

输出： ['天', '时', '分', '秒']

如果您的html大于问题中显示的html ，并且您只需要<div class="tB-mb">中的汉字，则可以首先在div提取文本，然后在该文本中进行搜索：

inside_text = re.search(r'<div class="tB-mb">[\s\S]+</div>', html).group()
result = re.findall(tmp,inside_text)

输出将是所需的。

为什么在python中使用“ regex”不能匹配汉字？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-07-09 07:01:11

为什么在python中使用“ regex”不能匹配汉字？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-07-09 07:01:11

解决方案1
2 已采纳 2017-07-09 07:01:11