简体   繁体   English

Python Scraping,网页不存在但​​网站重定向到另一个页面

[英]Python Scraping, Web page doesnot exist but the website redirects to another page

i am trying to find a way to know that if a web page exists or not.我试图找到一种方法来知道网页是否存在。 there are plenty of methods like httlib2, urlparse and using requests .有很多方法,如 httlib2、urlparse 和 using requests。 but in my case the website redirects me to the home page if the webpage doesnot exist eg https://www.thenews.com.pk/latest/category/sports/2015-09-21但在我的情况下,如果网页不存在,网站会将我重定向到主页,例如https://www.thenews.com.pk/latest/category/sports/2015-09-21

Is there any method to catch that ?有什么方法可以抓住吗?

You can check if the final url is the one you get redirected to, as well as if there was any history of redirects.您可以检查最终url是否是您被重定向到的那个,以及是否有任何重定向的history

>>> import requests
>>> target_url = "https://www.thenews.com.pk/latest/category/sports/2015-09-21"
>>> response = requests.get(target_url)
>>> response.history[0].url
u'https://www.thenews.com.pk/latest/category/sports/2015-09-21'
>>> response.url
u'https://www.thenews.com.pk/'
>>> response.history and response.url == 'https://www.thenews.com.pk/' != target_url
True

The URL you mention gives a Redirect return code (307) which you can catch.您提到的 URL 提供了一个您可以捕获的重定向返回代码 (307)。 See here:看这里:

$ curl -i https://www.thenews.com.pk/latest/category/sports/2015-09-21
HTTP/1.1 307 Temporary Redirect
Date: Sun, 26 Mar 2017 10:13:39 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=ddcd246615efb68a7c72c73f480ea81971490523219; expires=Mon, 26-Mar-18 10:13:39 GMT; path=/; domain=.thenews.com.pk; HttpOnly
Set-Cookie: bf_session=b02fb5b6cc732dc6c3b60332288d0f1d4f9f7360; expires=Sun, 26-Mar-2017 11:13:39 GMT; Max-Age=3600; path=/; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Location: https://www.thenews.com.pk/
X-Cacheable: YES
X-Varnish: 654909723
Age: 0
Via: 1.1 varnish
X-Age: 0
X-Cache: MISS
Access-Control-Allow-Origin: *
Server: cloudflare-nginx
CF-RAY: 345956a8be8a7289-AMS

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM