When I use lxml.etree to parse HTML, if the target tag contains multiple tags, I cannot parse all the text using XPath. For example:
content = """
<h3 id="author">
<span>
<a target="_blank">zhang</a>
</span>
<span>
<a target="_blank">wang</a>
</span>
<p class="email">1234567@qq.com</p>
<span>
<a target="_blank">li</a>
</span>
<span>
<a target="_blank">lin</a>
</span>
</h3>
"""
from lxml import etree
html_tree = etree.HTML(content)
print(html_tree.xpath('//h3[@id="author"]//text()'))
The result is:
['\n ',
'\n ',
'zhang',
'\n ',
'\n ',
'\n ',
'wang',
'\n ',
'\n ']
I can't get the text "Li" and "Lin", But when I delete the P tag, I can get all the text. For example:
content = """
<h3 id="author">
<span>
<a target="_blank">zhang</a>
</span>
<span>
<a target="_blank">wang</a>
</span>
<span>
<a target="_blank">li</a>
</span>
<span>
<a target="_blank">lin</a>
</span>
</h3>
"""
from lxml import etree
html_tree = etree.HTML(content)
print(html_tree.xpath('//h3[@id="author"]//text()'))
The result is:
['\n ',
'\n ',
'zhang',
'\n ',
'\n ',
'\n ',
'wang',
'\n ',
'\n ',
'\n ',
'li',
'\n ',
'\n ',
'\n ',
'lin',
'\n ',
'\n ']
If you parse from string, you will get the correct response:
from lxml import etree
content = """
<h3 id="author">
<span>
<a target="_blank">zhang</a>
</span>
<span>
<a target="_blank">wang</a>
</span>
<p class="email">1234567@qq.com</p>
<span>
<a target="_blank">li</a>
</span>
<span>
<a target="_blank">lin</a>
</span>
</h3>
"""
root = etree.fromstring(content)
print(root.xpath('//h3[@id="author"]//text()'))
Result:
['\n ', '\n ', 'zhang', '\n ', '\n ', '\n ', 'wang', '\n ', '\n ', '1234567@qq.com', '\n ', '\n ', 'li', '\n ', '\n ', '\n ', 'lin', '\n ', '\n ']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.