如何從雅虎財經中抓取特定數據？

Question

我是網絡抓取的新手，我正在嘗試為 AAPL 抓取雅虎財經的“統計”頁面。 這是鏈接： https : //finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL

這是我到目前為止的代碼......

from bs4 import BeautifulSoup
from requests import get


url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

stock_data = soup.find_all("table")

for stock in stock_data:
    print(stock.text)

當我運行它時，我返回頁面上的所有表數據。 但是，我只想要每個表中的特定數據（例如“市值”、“收入”、“Beta”）。

我嘗試通過執行print(stock[1].text)來處理代碼，以查看是否可以將返回的數據量限制為每個表中的第二個值，但返回錯誤消息。 我使用 BeautifulSoup 是在正確的軌道上，還是需要使用完全不同的庫？ 為了只返回特定數據而不是頁面上的所有表數據，我必須做什么？

Answer 1

檢查 HTML 代碼可以讓您最好地了解 BeautifulSoup 將如何處理它所看到的內容。

該網頁似乎包含多個表格，而這些表格又包含您要查找的信息。 這些表格遵循一定的邏輯。

首先抓取網頁上的所有表格，然后找到這些行包含的所有表格行 (<tr>) 和表格數據 (<td>)。

下面是實現這一目標的一種方法。 我什至加入了一個函數來只打印一個特定的測量值。

from bs4 import BeautifulSoup
from requests import get


url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one

for table in stock_data:
    # Scrape all table rows into variable trs
    trs = table.find_all('tr')
    for tr in trs:
        # Scrape all table data tags into variable tds
        tds = tr.find_all('td')
        # Index 0 of tds will contain the measurement
        print("Measure: {}".format(tds[0].get_text()))
        # Index 1 of tds will contain the value
        print("Value: {}".format(tds[1].get_text()))
        print("")


def get_measurement(table_array, measurement):
    for table in table_array:
        trs = table.find_all('tr')
        for tr in trs:
            tds = tr.find_all('td')
            if measurement.lower() in tds[0].get_text().lower():
                return(tds[1].get_text())


# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))

Answer 2

雖然這不是雅虎財經，但你可以做一些非常類似的事情......

import requests
from bs4 import BeautifulSoup

base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})

light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")

data = []
for rows_set in (light_rows, dark_rows):
    for row in rows_set:
        row_data = []
        for cell in row.find_all('td'):
            val = cell.a.get_text()
            row_data.append(val)
        data.append(row_data)

#   sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))

import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)

這是一個很好的替代品，以防雅虎決定貶低其 API 的更多功能。 我知道幾年前他們刪掉了很多東西（主要是歷史名言）。 這是悲傷地看到，走開。

如何從雅虎財經中抓取特定數據？

問題描述

2 個解決方案

解決方案1
2 已采納 2020-02-25 10:23:58

解決方案2
0 2020-03-12 00:23:34

如何從雅虎財經中抓取特定數據？

問題描述

2 個解決方案

解決方案1 2 已采納 2020-02-25 10:23:58

解決方案2 0 2020-03-12 00:23:34

解決方案1
2 已采納 2020-02-25 10:23:58

解決方案2
0 2020-03-12 00:23:34