简体   繁体   中英

How to scrape tables using Beautiful Soup?

I tried scraping tables according to the question: Python BeautifulSoup scrape tables

From the top solution, there I tried:

HTML code:

<div class="table-frame small">
    <table id="rfq-display-line-items-list" class="table">
        <thead id="rfq-display-line-items-header">
          <tr>
          <th>Mfr. Part/Item #</th>
          <th>Manufacturer</th>
          <th>Product/Service Name</th>
          <th>Qty.</th>
          <th>Unit</th>
          <th>Ship Address</th>
        </tr>
      </thead>
      <tbody id="rfq-display-line-item-0">

        <tr>
            <td><span class="small">43933</span></td>
            <td><span class="small">Anvil International</span></td>
            <td><span class="small">Cap Steel Black 1-1/2"</span></td>
            <td><span class="small">800</span></td>
            <td><span class="small">EA</span></td>
            <td><span class="small">1</span></td>
        </tr>
      <!----><!---->
      </tbody><tbody id="rfq-display-line-item-1">

        <tr>
            <td><span class="small">330035205</span></td>
            <td><span class="small">Anvil International</span></td>
            <td><span class="small">1-1/2" x 8" Black Steel Nipple</span></td>
            <td><span class="small">400</span></td>
            <td><span class="small">EA</span></td>
            <td><span class="small">1</span></td>
        </tr>
      <!----><!---->
      </tbody><!---->
    </table><!---->
</div>

According to solution ,

What I tried is:

for tr in soup.find_all('table', {'id': 'rfq-display-line-items-list'}):
    tds = tr.find_all('td')
    print(tds[0].text, tds[1].text, tds[2].text, tds[3].text, tds[4].text, tds[5].text)

But this displayed only the first row,

43933 Anvil International Cap Steel Black 1-1/2" 800 EA 1

I later out found out the all those <td> were stored in the list. I want to print all the rows.

Expected Output:

43933      Anvil International Cap Steel Black 1-1/2" 800 EA 1
330035205  Anvil International 1-1/2" x 8" Black Steel Nipple 400 EA 1         

You start with tr tag & go down to td

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

for tr in soup.find("table", id="rfq-display-line-items-list").find_all("tr"):
    print(" ".join([td.text for td in tr.find_all('td')]))

43933 Anvil International Cap Steel Black 1-1/2" 800 EA 1
330035205 Anvil International 1-1/2" x 8" Black Steel Nipple 400 EA 1

You can do that using css selectors as follows:

for tr in soup.select('table#rfq-display-line-items-list tbody tr'):
    tds = tr.find_all('td')
    print(tds[0].text, tds[1].text, tds[2].text, tds[3].text, tds[4].text, tds[5].text)

output:

43933      Anvil International Cap Steel Black 1-1/2" 800 EA 1
330035205  Anvil International 1-1/2" x 8" Black Steel Nipple 400 EA 1   

What happens?

While you are selecting your table with find_all() you would get a resultset with only one element (the table) and that is the reason, why your loop only iterate ones and print first row only.

How to fix?

Select your target more specific - As alternativ approach you also could use css selctors and stripped_strings to achieve your task.

This will select all <tr> from the <tbody> of element(table) with id="rfq-display-line-items-list" :

soup.select('#rfq-display-line-items-list tbody tr')

stripped_strings as generator get the strings of all the elements (the <td> s) in row and you can join() it to a string:

" ".join(list(row.stripped_strings))

Example

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

for row in soup.select('#rfq-display-line-items-list tbody tr'):
    print(" ".join(list(row.stripped_strings)))

Output

43933 Anvil International Cap Steel Black 1-1/2" 800 EA 1
330035205 Anvil International 1-1/2" x 8" Black Steel Nipple 400 EA 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM