从html提取所有表和h4

Question

我有一个html文件，我想从中提取所有表和h4元素。 那就是我只想从文件中获取表格和h4并在其他地方使用它。 我正在使用Notepad ++，并正在寻找一些pythonscript来这样做。

<html>
// header
<body>
  <div>
  <h4></h4>
  <h4></h4>
  <table>
    // some rows with cells here
    </table>
  // maybe some content here
  <table>
    // a form and other stuff
  </table>
  // probably some more text
 </div>
</body>
</html>

谢谢

Answer 1

我建议使用模块BeautifulSoup 。

您可以通过执行以下操作来完成所需的操作：

    from bs4 import BeautifulSoup

    code = file("file.html")
    html = code.read()
    soup = BeautifulSoup(html)
    htag = soup.findall('h4')
    tabletag = soup.findall('table')
    for h in htag:
        print h.text
    for table in tabletag:
        print table.text

Answer 2

由于已经提到BeautifulSoup，所以我只想暗示标准库的工具。

您可以使用内置的html解析器或正则表达式（请参阅教程）。

有时这些工具就足够了。 这取决于任务。

顺便说一句：Notepad ++支持正则表达式。 <h4.*?/h4>或<table.*?/table>允许您选择那些块。 在此处输入图片说明

Answer 3

建立的用于使用Python解析和编辑HTML的转到库称为BeautifulSoup 。

从html提取所有表和h4

问题描述

3 个解决方案

解决方案1
2 已采纳 2014-02-07 14:51:13

解决方案2
2 2014-02-07 14:59:04

解决方案3
1 2014-02-07 14:46:43

从html提取所有表和h4

问题描述

3 个解决方案

解决方案1 2 已采纳 2014-02-07 14:51:13

解决方案2 2 2014-02-07 14:59:04

解决方案3 1 2014-02-07 14:46:43

解决方案1
2 已采纳 2014-02-07 14:51:13

解决方案2
2 2014-02-07 14:59:04

解决方案3
1 2014-02-07 14:46:43