python BeautifulSoup table scraping

Question

my HTML has several tables, the first table is:

<table>
    <tr>
        <td>
            <div id="string">
            </div>
        </td>
    </tr>
</table>

and the rest are of the form:

<table class="confluenceTable" data-csvtable="1">
      <tbody>
          <tr>
             <th class="highlight-grey confluenceTh" data-highlight-colour="grey" rowspan="2" style="text-align: center;">Negev</th>

I want to scrape data from the tables. when I use:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'XXX'
soup = BeautifulSoup(urlopen(url).read(), "lxml")
for table in soup.findAll('table'):
    print(table)

it only finds the first table. when I change the search to:

soup.findAll("table", { "class" : "confluenceTable" })

it doesn't find anything. What am I missing?

using python 3.4 on windows with BeautifulSoup 4.5

Answer 1

I suspect you are trying to scrape an Atlassian Confluence page which is usually quite dynamic and makes use of JavaScript intensively to load the page. If you look into the HTML source you download with urllib you would not find table elements with confluenceTable class.

Instead, you should either look into using Confluence API , or use a browser automation tool like selenium .

python BeautifulSoup table scraping

Question

1 answers

solution1
2 ACCPTED 2016-07-24 14:16:02

python BeautifulSoup table scraping

Question

1 answers

solution1 2 ACCPTED 2016-07-24 14:16:02

solution1
2 ACCPTED 2016-07-24 14:16:02