简体   繁体   中英

Python error when webscraping HTTP Error 403: Forbidden

I'm a beginner at this and am trying to webscrape from the Congressional record. I have a .txt file (url_list.txt) with websites I'd like to download. The .txt file data look like this:

https://www.congress.gov/congressional-record/2003/3/12/house-section/article/h1752-1
https://www.congress.gov/congressional-record/2003/11/7/house-section/article/h10982-2
https://www.congress.gov/congressional-record/2003/1/29/house-section/article/h231-3

I'm using this code:

import urllib.request

with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
   for line in f:
      url = line
      path = '/Users/myusername/Desktop/py_test'+url.split('/', -1)[-1]
      urllib.request.urlretrieve(url, path.rstrip('\n'))username

I get this error:

Traceback (most recent call last):
  File "/Users/myusername/Desktop/py_test/py_try.py", line 7, in <module>
    urllib.request.urlretrieve(url, path.rstrip('\n'))
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 241, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Any help would be appreciated.

Http error 403 means you have been blocked from accessing the resource you have requested (read more here )

Check that the URL you are trying to request is correct (try printing the URL to make sure it is correct) and if it is, you may need to edit the User-Agent header of the request.

To do this, I suggest using requests and not urllib as requests is much easier to use. Using requests, your code may go a little something like this:

import requests

with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
    url_list = f.read().split("\n")

for url in url_list:
    with open('/Users/myusername/Desktop/py_test/' + url.split('/')[-1], 'w') as f:
        with requests.get(url, headers={'User-agent': 'Mozilla/5.0'}) as r:
            f.write(r.text)

If that doesn't work, then you have probably been blocked from accessing the website and there's not much you can do about that as it is server side and not client side

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM