简体   繁体   中英

python BeautifulSoup table scraping

my HTML has several tables, the first table is:

<table>
    <tr>
        <td>
            <div id="string">
            </div>
        </td>
    </tr>
</table>

and the rest are of the form:

<table class="confluenceTable" data-csvtable="1">
      <tbody>
          <tr>
             <th class="highlight-grey confluenceTh" data-highlight-colour="grey" rowspan="2" style="text-align: center;">Negev</th>

I want to scrape data from the tables. when I use:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'XXX'
soup = BeautifulSoup(urlopen(url).read(), "lxml")
for table in soup.findAll('table'):
    print(table)

it only finds the first table. when I change the search to:

soup.findAll("table", { "class" : "confluenceTable" })

it doesn't find anything. What am I missing?

using python 3.4 on windows with BeautifulSoup 4.5

I suspect you are trying to scrape an Atlassian Confluence page which is usually quite dynamic and makes use of JavaScript intensively to load the page. If you look into the HTML source you download with urllib you would not find table elements with confluenceTable class.

Instead, you should either look into using Confluence API , or use a browser automation tool like selenium .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM