python BeautifulSoup表刮

Question

我的HTML有幾個表，第一個表是：

<table>
    <tr>
        <td>
            <div id="string">
            </div>
        </td>
    </tr>
</table>

其余形式為：

<table class="confluenceTable" data-csvtable="1">
      <tbody>
          <tr>
             <th class="highlight-grey confluenceTh" data-highlight-colour="grey" rowspan="2" style="text-align: center;">Negev</th>

我想從表中抓取數據。 當我使用時：

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'XXX'
soup = BeautifulSoup(urlopen(url).read(), "lxml")
for table in soup.findAll('table'):
    print(table)

它只會找到第一個表。 當我將搜索更改為：

soup.findAll("table", { "class" : "confluenceTable" })

它什么也沒找到。 我想念什么？

在Windows上使用BeautifulSoup 4.5使用python 3.4

Answer 1

我懷疑您正在嘗試抓取Atlassian Confluence頁面，該頁面通常非常動態，並且大量使用JavaScript來加載頁面。 如果查看使用urllib下載的HTML源代碼，則找不到帶有confluenceTable類的table元素。

相反，您應該考慮使用Confluence API ，或者使用諸如selenium類的瀏覽器自動化工具。

python BeautifulSoup表刮

問題描述

1 個解決方案

解決方案1
2 已采納 2016-07-24 14:16:02

python BeautifulSoup表刮

問題描述

1 個解決方案

解決方案1 2 已采納 2016-07-24 14:16:02

解決方案1
2 已采納 2016-07-24 14:16:02