简体   繁体   中英

why can not match chinese characters by using ”regex“ at python ?

 import re
 html="""<div class="tB-mb">
                   <span class="t-d">0</span> 
                   <span class="t-d">0</span> 天 
                   <span class="t-h">0</span>
                   <span class="t-h">0</span> 时
                   <span class="t-m">0</span>
                   <span class="t-m">0</span> 分 
                   <span class="t-s">0</span>
                   <span class="t-s">0</span> 秒
     """
 tmp=re.compile(u"(<div class='tB-mb'>).*?([\u4e00-\u9fa5]).*?",re.U)
 result=re.findall(tmp,html.decode("utf-8"))
 print result
 []

As mentioned above why my code can not match Chinese characters?

Yuu're using single quotes in <div class='tB-mb'> for your regex pattern whereas html has div 's class in double quotes. I think there's a simpler pattern which can extract what you want:

tmp = re.compile(u"(?m)([\u4e00-\u9fa5])+", re.U)
result=re.findall(tmp,html)
print result

Output: ['天', '时', '分', '秒']

If your html is bigger than what is shown in the question, and you want only the Chinese characters in <div class="tB-mb"> , you can first extract the text within the div and then search inside that text:

inside_text = re.search(r'<div class="tB-mb">[\s\S]+</div>', html).group()
result = re.findall(tmp,inside_text)

Output will be as desired.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM