why can not match chinese characters by using ”regex“ at python ？

Question

 import re
 html="""<div class="tB-mb">
                   <span class="t-d">0</span> 
                   <span class="t-d">0</span> 天 
                   <span class="t-h">0</span>
                   <span class="t-h">0</span> 时
                   <span class="t-m">0</span>
                   <span class="t-m">0</span> 分 
                   <span class="t-s">0</span>
                   <span class="t-s">0</span> 秒
     """
 tmp=re.compile(u"(<div class='tB-mb'>).*?([\u4e00-\u9fa5]).*?",re.U)
 result=re.findall(tmp,html.decode("utf-8"))
 print result
 []

As mentioned above why my code can not match Chinese characters？

Answer 1

Yuu're using single quotes in <div class='tB-mb'> for your regex pattern whereas html has div 's class in double quotes. I think there's a simpler pattern which can extract what you want:

tmp = re.compile(u"(?m)([\u4e00-\u9fa5])+", re.U)
result=re.findall(tmp,html)
print result

Output: ['天', '时', '分', '秒']

If your html is bigger than what is shown in the question, and you want only the Chinese characters in <div class="tB-mb"> , you can first extract the text within the div and then search inside that text:

inside_text = re.search(r'<div class="tB-mb">[\s\S]+</div>', html).group()
result = re.findall(tmp,inside_text)

Output will be as desired.

why can not match chinese characters by using ”regex“ at python ？

Question

1 answers

solution1
2 ACCPTED 2017-07-09 07:01:11

why can not match chinese characters by using ”regex“ at python ？

Question

1 answers

solution1 2 ACCPTED 2017-07-09 07:01:11

solution1
2 ACCPTED 2017-07-09 07:01:11