如何使用lxml将这些html标签打印为文本？

Question

So I have a webpage that has a large list of links. 所以我有一个网页，其中包含大量链接。 They are all contained inside <li> tags. 它们都包含在<li>标记内。

The <li> tags are inside a <ol> tag inside a <div> and so on like this: <li>标记位于<div>内的<ol>标记内，依此类推：

html --> body --> table --> tbody --> tr --> td --> table --> tbody --> tr --> td --> div --> ol

And then the <li> tags are inside the <ol> . 然后<li>标记位于<ol> 。

How can I use lxml in Python to print the <li> tags' html as text? 如何在Python中使用lxml将<li>标签的html打印为文本？

Answer 1

Using BeautifulSoup (which builds on the lxml library) 使用BeautifulSoup （基于lxml库构建）

import bs4

text = """<html>
 <body>
  <table>
   <tbody>
    <tr>
     <td>
      <table>
       <tbody>
        <tr>
         <td>
          <div>
           <ol>
            <li>
             <a href="test.html" title="test title">Link Text</a>
             <a href="test2.html" title="test title 2">Link2 Text</a>
            </li>
           </ol>
          </div>
         </td>
        </tr>
       </tbody>
      </table>
     </td>
    </tr>
   </tbody>
  </table>
 </body>
</html>"""

soup = bs4.BeautifulSoup(text)

listitems = soup.select("table > tbody > tr > td > table > tbody > tr > td > div > ol > li")
tags = [tag for tag in listitems[0] if isinstance(tag,bs4.element.Tag)]
for tag in tags:
    print(tag)

# OUTPUT
# <a href="test.html" title="test title">Link Text</a>
# <a href="test2.html" title="test title 2">Link2 Text</a>

Answer 2

The solution below should do it in lxml, however, beautiful soup will probably be a much better solution and handle malformed HTML much better. 下面的解决方案应该在lxml中完成，但是，漂亮的汤可能是一个更好的解决方案，并且可以更好地处理格式错误的HTML。

import lxml.etree as etree

tree = etree.parse(open("test.html"))
for li in tree.iterfind(".//td/div/ol/li"):
    print etree.tostring(li[0])

I'll edit with a beautifulsoup answer in a minute. 一分钟后，我将以Beautifulsoup答案进行编辑。 EDIT : See Adam's solution. 编辑：见亚当的解决方案。

如何使用lxml将这些html标签打印为文本？

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-03-25 20:43:11

解决方案2
1 2014-03-25 20:43:46

如何使用lxml将这些html标签打印为文本？

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-03-25 20:43:11

解决方案2 1 2014-03-25 20:43:46

解决方案1
1 已采纳 2014-03-25 20:43:11

解决方案2
1 2014-03-25 20:43:46