使用BeautifulSoup從網頁獲取特定表

Question

我想從http://www.dividend.com/dividend-stocks/上的第3個表中獲取數據。 這是代碼，我需要一些幫助。

import requests
from bs4 import BeautifulSoup

url = "http://www.dividend.com/dividend-stocks/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")

# Skip first two tables
tables = soup.find("table")
tables = tables.find_next("table")
tables = tables.find_next("table")

row = ''
for td in tables.find_all("td"):
    if len(td.text.strip()) > 0:
        row = row + td.text.strip().replace('\n', ' ') +','
        # Handle last column in a row, remove extra comma and add new line
        if td.get('data-th') == 'Pay Date':
            row = row[:-1] + '\n'
print(row)

有沒有更好的方法跳過兩個表？ 還是有一種簡單的方法可以跳過漂亮的湯中的一大堆代碼？ 如果是這樣，我該如何定位？
代碼的輸出順序與網絡上的輸出順序有所不同。 網絡上的表格如下所示：

但是代碼輸出是這樣的：

AAPL,Apple Inc.,1.76%,$143.39,$2.52,5/11,5/18
GE,General Electric,3.32%,$28.91,$0.96,6/15,7/25
XOM,Exxon Mobil,3.71%,$83.03,$3.08,5/10,6/9
CVX,Chevron Corp,4.01%,$107.72,$4.32,5/17,6/12
BP,BP PLC ADR,6.66%,$35.72,$2.38,5/10,6/23

我做錯了什么？ 謝謝你的幫助！

Answer 1

您可以使用選擇器來查找特定的表：

tables = soup.select("table:nth-of-type(3)")

我不確定您的結果為何與網頁上顯示的結果順序不同。

Answer 2

盡管@Barmar的方法看起來更干凈， soup.find_all是使用soup.find_all並保存為JSON的另一種方法（即使描述中未包含）。

import json

import requests
from bs4 import BeautifulSoup

url = 'http://www.dividend.com/dividend-stocks/'
r = requests.get(url)
r.raise_for_status()
soup = BeautifulSoup(r.content, 'lxml')
stocks = {}

# Skip first two tables and header row of target table
for tr in soup.find_all('table')[2].find_all('tr')[1:]:
    (stock_symbol, company_name, _, dividend_yield, current_price,
     annual_dividend, ex_dividend_date, pay_date) = [
        td.text.strip() for td in tr.find_all('td')]
    stocks[stock_symbol] = {
        'company_name': company_name,
        'dividend_yield': float(dividend_yield.rstrip('%')),
        'current_price': float(current_price.lstrip('$')),
        'annual_dividend': float(annual_dividend.lstrip('$')),
        'ex_dividend_date': ex_dividend_date,
        'pay_date': pay_date
    }

with open('stocks.json', 'w') as f:
    json.dump(stocks, f, indent=2)

Answer 3

感謝@Barmar和@Delirious生菜發布解決方案和代碼。 關於輸出的順序，我意識到每次刷新數據時，就像拉動輸出一樣，我看到的數據一覽無余。 然后我看到排序的數據。 嘗試了幾種不同的方法，我能夠使用Selenium Webdriver像顯示的Web一樣提取數據。 謝謝大家

BPT,BP Prudhoe Bay Royalty Trust,21.12%,$20.80,$4.39,4/11,4/20
PER,Sandridge Permian Trust,18.06%,$2.88,$0.52,5/10,5/26
CHKR,Chesapeake Granite Wash Trust,16.75%,$2.40,$0.40,5/18,6/1
NAT,Nordic American Tankers,13.33%,$6.00,$0.80,5/18,6/8
WIN,Windstream Corp,13.22%,$4.54,$0.60,6/28,7/17
NYMT,New York Mortgage Trust Inc,12.14%,$6.59,$0.80,6/22,7/25
IEP,Icahn Enterprises L.P.,11.65%,$51.50,$6.00,5/11,6/14
FTR,Frontier Communications,11.51%,$1.39,$0.16,6/13,6/30

使用BeautifulSoup從網頁獲取特定表

問題描述

3 個解決方案

解決方案1
2 2017-06-16 22:54:17

解決方案2
1 2017-06-16 23:40:18

解決方案3
1 2017-06-19 16:10:01

使用BeautifulSoup從網頁獲取特定表

問題描述

3 個解決方案

解決方案1 2 2017-06-16 22:54:17

解決方案2 1 2017-06-16 23:40:18

解決方案3 1 2017-06-19 16:10:01

解決方案1
2 2017-06-16 22:54:17

解決方案2
1 2017-06-16 23:40:18

解決方案3
1 2017-06-19 16:10:01