I'm a beginner at this and am trying to webscrape from the Congressional record. I have a .txt file (url_list.txt) with websites I'd like to download. The .txt file data look like this:
https://www.congress.gov/congressional-record/2003/3/12/house-section/article/h1752-1
https://www.congress.gov/congressional-record/2003/11/7/house-section/article/h10982-2
https://www.congress.gov/congressional-record/2003/1/29/house-section/article/h231-3
I'm using this code:
import urllib.request
with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
for line in f:
url = line
path = '/Users/myusername/Desktop/py_test'+url.split('/', -1)[-1]
urllib.request.urlretrieve(url, path.rstrip('\n'))username
I get this error:
Traceback (most recent call last):
File "/Users/myusername/Desktop/py_test/py_try.py", line 7, in <module>
urllib.request.urlretrieve(url, path.rstrip('\n'))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Any help would be appreciated.
Http error 403 means you have been blocked from accessing the resource you have requested (read more here )
Check that the URL you are trying to request is correct (try printing the URL to make sure it is correct) and if it is, you may need to edit the User-Agent header of the request.
To do this, I suggest using requests
and not urllib
as requests
is much easier to use. Using requests, your code may go a little something like this:
import requests
with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
url_list = f.read().split("\n")
for url in url_list:
with open('/Users/myusername/Desktop/py_test/' + url.split('/')[-1], 'w') as f:
with requests.get(url, headers={'User-agent': 'Mozilla/5.0'}) as r:
f.write(r.text)
If that doesn't work, then you have probably been blocked from accessing the website and there's not much you can do about that as it is server side and not client side
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.