import re
html="""<div class="tB-mb">
<span class="t-d">0</span>
<span class="t-d">0</span> 天
<span class="t-h">0</span>
<span class="t-h">0</span> 时
<span class="t-m">0</span>
<span class="t-m">0</span> 分
<span class="t-s">0</span>
<span class="t-s">0</span> 秒
"""
tmp=re.compile(u"(<div class='tB-mb'>).*?([\u4e00-\u9fa5]).*?",re.U)
result=re.findall(tmp,html.decode("utf-8"))
print result
[]
As mentioned above why my code can not match Chinese characters?
Yuu're using single quotes in <div class='tB-mb'>
for your regex pattern whereas html
has div
's class
in double quotes. I think there's a simpler pattern which can extract what you want:
tmp = re.compile(u"(?m)([\u4e00-\u9fa5])+", re.U)
result=re.findall(tmp,html)
print result
Output: ['天', '时', '分', '秒']
If your html
is bigger than what is shown in the question, and you want only the Chinese characters in <div class="tB-mb">
, you can first extract the text within the div
and then search inside that text:
inside_text = re.search(r'<div class="tB-mb">[\s\S]+</div>', html).group()
result = re.findall(tmp,inside_text)
Output will be as desired.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.