网页抓取时出现 Python 错误 HTTP 错误 403：禁止

Question

我是这方面的初学者，正在尝试从国会记录中进行网络抓取。 我有一个包含我想下载的网站的 .txt 文件 (url_list.txt)。 .txt 文件数据如下所示：

https://www.congress.gov/congressional-record/2003/3/12/house-section/article/h1752-1
https://www.congress.gov/congressional-record/2003/11/7/house-section/article/h10982-2
https://www.congress.gov/congressional-record/2003/1/29/house-section/article/h231-3

我正在使用这段代码：

import urllib.request

with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
   for line in f:
      url = line
      path = '/Users/myusername/Desktop/py_test'+url.split('/', -1)[-1]
      urllib.request.urlretrieve(url, path.rstrip('\n'))username

我收到此错误：

Traceback (most recent call last):
  File "/Users/myusername/Desktop/py_test/py_try.py", line 7, in <module>
    urllib.request.urlretrieve(url, path.rstrip('\n'))
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 241, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

任何帮助，将不胜感激。

Answer 1

Http 错误 403 表示您已被阻止访问您请求的资源（在此处阅读更多信息）

检查您尝试请求的 URL 是否正确（尝试打印 URL 以确保其正确），如果正确，您可能需要编辑请求的 User-Agent 标头。

为此，我建议使用requests而不是urllib ，因为requests更容易使用。 使用请求，您的代码可能有点像这样：

import requests

with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
    url_list = f.read().split("\n")

for url in url_list:
    with open('/Users/myusername/Desktop/py_test/' + url.split('/')[-1], 'w') as f:
        with requests.get(url, headers={'User-agent': 'Mozilla/5.0'}) as r:
            f.write(r.text)

如果这不起作用，那么您可能已被阻止访问该网站，并且您对此无能为力，因为它是服务器端而不是客户端

网页抓取时出现 Python 错误 HTTP 错误 403：禁止

问题描述

1 个解决方案

解决方案1
0 2022-06-22 18:12:57

网页抓取时出现 Python 错误 HTTP 错误 403：禁止

问题描述

1 个解决方案

解决方案1 0 2022-06-22 18:12:57

解决方案1
0 2022-06-22 18:12:57