Problem HTTP error 403 in Python 3 Web Scraping

Question

我試圖抓取一個網站進行練習，但我一直收到 HTTP 錯誤 403（它認為我是機器人嗎）？

這是我的代碼：

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

我得到的錯誤是：

 File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Answer 1

這可能是因為mod_security或一些類似的服務器安全功能會阻止已知的蜘蛛/機器人用戶代理（ urllib使用類似python urllib/3.3.0的東西，很容易檢測到）。 嘗試設置一個已知的瀏覽器用戶代理：

from urllib.request import Request, urlopen

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

這對我有用。

順便說一句，在您的代碼中，您在urlopen行中缺少.read之后的() ，但我認為這是一個錯字。

提示：由於這是練習，請選擇一個不同的、非限制性的站點。 也許他們出於某種原因阻止了urllib ......

Answer 2

絕對是因為您使用基於用戶代理的 urllib 而導致阻塞。 OfferUp 也發生了同樣的事情。 您可以創建一個名為 AppURLopener 的新類，它使用 Mozilla 覆蓋用戶代理。

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')

資源

Answer 3

“這可能是因為mod_security或某些類似的服務器安全功能阻止了已知

蜘蛛/機器人

用戶代理（urllib 使用類似 python urllib/3.3.0 的東西，它很容易被檢測到）”——正如 Stefano Sanfilippo 已經提到的

from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

web_byte是服務器返回的字節對象，網頁中存在的內容類型主要是utf-8 。 因此，您需要使用 decode 方法對web_byte進行解碼。

這解決了我在嘗試使用PyCharm從網站上抓取數據時的完整問題

PS->我使用python 3.4

Answer 4

根據之前的答案，這對我使用 Python 3.7 有效，方法是將超時時間增加到 10。

from urllib.request import Request, urlopen

req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)

Answer 5

由於頁面在瀏覽器中工作，而不是在 python 程序中調用時，似乎提供該url的網絡應用程序識別出您不是通過瀏覽器請求內容。

示范：

curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1

...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>

並且 r.txt 中的內容有狀態行：

HTTP/1.1 403 Forbidden

嘗試發布偽造Web 客戶端的標題“User-Agent”。

注意：該頁面包含創建您可能要解析的表的 Ajax 調用。 您需要檢查頁面的 javascript 邏輯或簡單地使用瀏覽器調試器（如 Firebug / Net 選項卡）來查看需要調用哪個 url 來獲取表的內容。

Answer 6

如果您對將用戶代理偽裝成 Mozilla 感到內疚（在 Stefano 的最佳答案中評論），它也可以與非 urllib 用戶代理一起使用。 這適用於我參考的網站：

    req = urlrequest.Request(link, headers={'User-Agent': 'XYZ/3.0'})
    urlrequest.urlopen(req, timeout=10).read()

我的應用程序是通過抓取我在文章中引用的特定鏈接來測試有效性。 不是通用刮刀。

Answer 7

你可以嘗試兩種方式。 詳細信息在此鏈接中。

1）通過點子

pip install --upgrade 證書

2）如果它不起作用，請嘗試運行與 Python 3.* for Mac 捆綁在一起的Cerificates.command ：（轉到您的 python 安裝位置並雙擊該文件）

打開 /Applications/Python\ 3.*/Install\ Certificates.command

Answer 8

向請求標頭添加 cookie 對我有用

from urllib.request import Request, urlopen

# Function to get the page content
def get_page_content(url, head):
  """
  Function to get the page content
  """
  req = Request(url, headers=head)
  return urlopen(req)

url = 'https://example.com'
head = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive',
  'refere': 'https://example.com',
  'cookie': """your cookie value ( you can get that from your web page) """
}

data = get_page_content(url, head).read()
print(data)

Answer 9

我遇到了同樣的問題，但無法使用上面的答案解決它。 我最終通過使用 requests.get() 然后使用結果的 .text 而不是使用 read() 解決了這個問題：

from requests import get

req = get(link)
result = req.text

Answer 10

你可以像這樣使用 urllib 的 build_opener：

opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'), ('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'), ('Accept-Encoding','gzip, deflate, br'),\
    ('Accept-Language','en-US,en;q=0.5' ), ("Connection", "keep-alive"), ("Upgrade-Insecure-Requests",'1')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(url, "test.xlsx")

Answer 11

我用這個把頭發拉了一會兒，答案很簡單。 我檢查了響應文本，我收到了“URL 簽名已過期”，這是您通常不會看到的消息，除非您檢查了響應文本。

這意味着某些 URL 會過期，通常是出於安全目的。 嘗試再次獲取 URL 並在腳本中更新 URL。 如果您要抓取的內容沒有新的 URL ，那么很遺憾您無法抓取它。

Answer 12

打開開發者工具並打開.network tap。 在你想要的項目中選擇你的廢品，擴展細節將有用戶代理並將其添加到那里

Problem HTTP error 403 in Python 3 Web Scraping

問題描述

12 個解決方案

解決方案1
309 已采納 2013-05-18 17:52:11

解決方案2
52 2015-08-01 06:00:29

解決方案3
27 2017-12-25 07:57:59

解決方案4
6 2020-04-16 18:48:32

解決方案5
2 2013-05-18 17:55:26

解決方案6
2 2020-03-14 03:22:10

解決方案7
1 2018-11-16 03:55:40

解決方案8
1 2022-04-02 10:04:31

解決方案9
1 2022-06-13 14:54:46

解決方案10
0 2022-08-09 14:43:49

解決方案11
0 2022-08-23 17:07:14

解決方案12
0 2023-01-18 10:02:30

Problem HTTP error 403 in Python 3 Web Scraping

問題描述

12 個解決方案

解決方案1 309 已采納 2013-05-18 17:52:11

解決方案2 52 2015-08-01 06:00:29

解決方案3 27 2017-12-25 07:57:59

解決方案4 6 2020-04-16 18:48:32

解決方案5 2 2013-05-18 17:55:26

解決方案6 2 2020-03-14 03:22:10

解決方案7 1 2018-11-16 03:55:40

解決方案8 1 2022-04-02 10:04:31

解決方案9 1 2022-06-13 14:54:46

解決方案10 0 2022-08-09 14:43:49

解決方案11 0 2022-08-23 17:07:14

解決方案12 0 2023-01-18 10:02:30

解決方案1
309 已采納 2013-05-18 17:52:11

解決方案2
52 2015-08-01 06:00:29

解決方案3
27 2017-12-25 07:57:59

解決方案4
6 2020-04-16 18:48:32

解決方案5
2 2013-05-18 17:55:26

解決方案6
2 2020-03-14 03:22:10

解決方案7
1 2018-11-16 03:55:40

解決方案8
1 2022-04-02 10:04:31

解決方案9
1 2022-06-13 14:54:46

解決方案10
0 2022-08-09 14:43:49

解決方案11
0 2022-08-23 17:07:14

解決方案12
0 2023-01-18 10:02:30