繁体   English   中英

网页抓取时出现 Python 错误 HTTP 错误 403:禁止

[英]Python error when webscraping HTTP Error 403: Forbidden

我是这方面的初学者,正在尝试从国会记录中进行网络抓取。 我有一个包含我想下载的网站的 .txt 文件 (url_list.txt)。 .txt 文件数据如下所示:

https://www.congress.gov/congressional-record/2003/3/12/house-section/article/h1752-1
https://www.congress.gov/congressional-record/2003/11/7/house-section/article/h10982-2
https://www.congress.gov/congressional-record/2003/1/29/house-section/article/h231-3

我正在使用这段代码:

import urllib.request

with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
   for line in f:
      url = line
      path = '/Users/myusername/Desktop/py_test'+url.split('/', -1)[-1]
      urllib.request.urlretrieve(url, path.rstrip('\n'))username

我收到此错误:

Traceback (most recent call last):
  File "/Users/myusername/Desktop/py_test/py_try.py", line 7, in <module>
    urllib.request.urlretrieve(url, path.rstrip('\n'))
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 241, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

任何帮助,将不胜感激。

Http 错误 403 表示您已被阻止访问您请求的资源(在此处阅读更多信息)

检查您尝试请求的 URL 是否正确(尝试打印 URL 以确保其正确),如果正确,您可能需要编辑请求的 User-Agent 标头。

为此,我建议使用requests而不是urllib ,因为requests更容易使用。 使用请求,您的代码可能有点像这样:

import requests

with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
    url_list = f.read().split("\n")

for url in url_list:
    with open('/Users/myusername/Desktop/py_test/' + url.split('/')[-1], 'w') as f:
        with requests.get(url, headers={'User-agent': 'Mozilla/5.0'}) as r:
            f.write(r.text)

如果这不起作用,那么您可能已被阻止访问该网站,并且您对此无能为力,因为它是服务器端而不是客户端

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM