[英]How can I print these html tags as text with lxml?
So I have a webpage that has a large list of links. 所以我有一个网页,其中包含大量链接。 They are all contained inside
<li>
tags. 它们都包含在
<li>
标记内。
The <li>
tags are inside a <ol>
tag inside a <div>
and so on like this: <li>
标记位于<div>
内的<ol>
标记内,依此类推:
html --> body --> table --> tbody --> tr --> td --> table --> tbody --> tr --> td --> div --> ol
And then the <li>
tags are inside the <ol>
. 然后
<li>
标记位于<ol>
。
How can I use lxml
in Python to print the <li>
tags' html as text? 如何在Python中使用
lxml
将<li>
标签的html打印为文本?
Using BeautifulSoup
(which builds on the lxml
library) 使用
BeautifulSoup
(基于lxml
库构建)
import bs4
text = """<html>
<body>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div>
<ol>
<li>
<a href="test.html" title="test title">Link Text</a>
<a href="test2.html" title="test title 2">Link2 Text</a>
</li>
</ol>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>"""
soup = bs4.BeautifulSoup(text)
listitems = soup.select("table > tbody > tr > td > table > tbody > tr > td > div > ol > li")
tags = [tag for tag in listitems[0] if isinstance(tag,bs4.element.Tag)]
for tag in tags:
print(tag)
# OUTPUT
# <a href="test.html" title="test title">Link Text</a>
# <a href="test2.html" title="test title 2">Link2 Text</a>
The solution below should do it in lxml, however, beautiful soup will probably be a much better solution and handle malformed HTML much better. 下面的解决方案应该在lxml中完成,但是,漂亮的汤可能是一个更好的解决方案,并且可以更好地处理格式错误的HTML。
import lxml.etree as etree
tree = etree.parse(open("test.html"))
for li in tree.iterfind(".//td/div/ol/li"):
print etree.tostring(li[0])
I'll edit with a beautifulsoup answer in a minute. 一分钟后,我将以Beautifulsoup答案进行编辑。 EDIT : See Adam's solution.
编辑 :见亚当的解决方案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.