雅虎財經python上的某些股票和頁面出現404錯誤

Question

我正在嘗試通過此 URL https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL從雅虎財經中抓取數據。 運行下面的python代碼后，我得到以下HTML響應

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
from lxml import html

stockStatDict = {}
stockSymbol = 'AAPL'
URL = 'https://finance.yahoo.com/quote/'+ stockSymbol + '/key-statistics?p=' + stockSymbol
page = requests.get(URL)
print(page.text)


<!DOCTYPE html>
  <html lang="en-us"><head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <meta charset="utf-8">
      <title>Yahoo</title>
      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
      <style>
  html {
      height: 100%;
  }
  body {
      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;
      background-size: cover;
      height: 100%;
      text-align: center;
      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;
  }
  table {
      height: 100%;
      width: 100%;
      table-layout: fixed;
      border-collapse: collapse;
      border-spacing: 0;
      border: none;
  }
  h1 {
      font-size: 42px;
      font-weight: 400;
      color: #400090;
  }
  p {
      color: #1A1A1A;
  }
  #message-1 {
      font-weight: bold;
      margin: 0;
  }
  #message-2 {
      display: inline-block;
      *display: inline;
      zoom: 1;
      max-width: 17em;
      _width: 17em;
  }
      </style>
  <script>
    document.write('<img src="//geo.yahoo.com/b?s=1197757129&t='+new Date().getTime()+'&src=aws&err_url='+encodeURIComponent(document.URL)+'&err=%<pssc>&test='+encodeURIComponent('%<{Bucket}cqh[:200]>')+'" width="0px" height="0px"/>');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent('%<{Bucket}cqh[:200]>');
  </script>
  </head>
  <body>
  <!-- status code : 404 -->
  <!-- Not Found on Server -->
  <table>
  <tbody><tr>
      <td>
      <img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">
      <h1 style="margin-top:20px;">Will be right back...</h1>
      <p id="message-1">Thank you for your patience.</p>
      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>
      </td>
  </tr>
  </tbody></table>
  </body></html>

我很困惑，因為我使用以下代碼在此 URL https://finance.yahoo.com/quote/AAPL?p=AAPL上抓取摘要選項卡上的數據沒有問題

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
from lxml import html
stockDict = {}

stockSymbol = 'AAPL'
URL = 'https://finance.yahoo.com/quote/'+ stockSymbol + '?p=' + stockSymbol
page = requests.get(URL)
print(page.text)

soup = BeautifulSoup(page.content, 'html.parser')


stock_data = soup.find_all('table')

stock_data
for table in stock_data:
    
    trs = table.find_all('tr')
    
    for tr in trs:
        
        
        tds = tr.find_all('td')
        
        if len(tds) > 0:
         
            stockDict[tds[0].get_text()] = [tds[1].get_text()]


stock_sum_df = pd.DataFrame(data=stockDict)
print(stock_sum_df.head())
print(stock_sum_df.info())

任何人都知道我做錯了什么？ 如果這有什么不同，我也在使用雅虎財經的免費版本。

Answer 1

所以我想出了你的問題。

User-Agent 請求頭包含一個特征字符串，允許網絡協議對等體識別請求軟件用戶代理的應用程序類型、操作系統、軟件供應商或軟件版本。 在服務器端驗證 User-Agent 標頭是一項常見操作，因此請確保使用有效瀏覽器的 User-Agent 字符串以避免被阻止。

來源： http : //go-colly.org/articles/scraping_related_http_headers/ )

您唯一需要做的就是設置一個合法的用戶代理。 因此添加標題來模擬瀏覽器：

# This is a standard user-agent of Chrome browser running on Windows 10 
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }

例子：

import requests 
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
stockSymbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/'+ stockSymbol + '/key-statistics?p=' + stockSymbol
resp = requests.get(url, headers=headers, timeout=5).text 
print(resp)

此外，您可以添加另一組標頭以偽裝成合法瀏覽器。 添加更多這樣的標題：

headers = { 
    'User-Agent'      : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
    'Accept'          : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
    'Accept-Language' : 'en-US,en;q=0.5',
    'DNT'             : '1', # Do Not Track Request Header 
    'Connection'      : 'close'
}

這些事情通常由兩個主要原因引起：

反自動化系統（檢測機器人/爬蟲的系統）
傾向於根據您訪問的瀏覽器來調節內容的網站。

因此，在設計自動化系統時，在頭文件中提供用戶代理總是一個好主意。

Answer 2

不確定是什么導致了問題以及您的項目的意圖是什么。 但是，如果您的意圖是能夠使用雅虎財務數據做一些事情 - 而不是學習如何抓取數據，那么以下模塊可以幫助您（ https://pypi.org/project/yahoo-finance/ ）

雅虎財經python上的某些股票和頁面出現404錯誤

問題描述

2 個解決方案

解決方案1
6 2021-07-05 16:35:03

解決方案2
0 2021-07-05 16:16:33

雅虎財經python上的某些股票和頁面出現404錯誤

問題描述

2 個解決方案

解決方案1 6 2021-07-05 16:35:03

解決方案2 0 2021-07-05 16:16:33

解決方案1
6 2021-07-05 16:35:03

解決方案2
0 2021-07-05 16:16:33