I'm trying to retrieve the HTTP status code of a list of URLs in python, using the following piece of code:
try:
r = requests.head(testpoint_url)
print(testpoint_url+" : "+str(r.status_code))
# prints the int of the status code.
except requests.ConnectionError:
print("failed to connect")
Surprisingly, for some URLs, I get 302
status code while if browsed by a browser, you see it showing a 404 code!
What is going on? How can I get the real status code (eg 404)?
302 is an HTTP redirection. A web browser will follow the redirect to the URL reported in the Location
response header. When requesting that next URL, it will have its own response code, which can include 404.
Your Python code does not follow the redirect, which would explain why it gets the original 302 instead.
Per the Requests documentation:
By default Requests will perform location redirection for all verbs except HEAD .
We can use the
history
property of the Response object to track redirection.The
Response.history
list contains theResponse
objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response....
If you're using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the
allow_redirects
parameter:>>> r = requests.get('https://github.com/', allow_redirects=False) >>> r.status_code 301 >>> r.history []
If you're using HEAD, you can enable redirection as well :
>>> r = requests.head('https://github.com/', allow_redirects=True) >>> r.url 'https://github.com/' >>> r.history [<Response [301]>]
So, in your code, change this:
r = requests.head(testpoint_url)
To this:
r = requests.head(testpoint_url, allow_redirects=True)
Then r.status_code
will be the final status code (ie, 404) after all redirects have been followed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.