简体   繁体   English

链接状态代码200重定向

[英]Link with status code 200 redirects

I have a link which has status code 200. But when I open it in browser it redirects. 我有一个状态代码为200的链接。但是当我在浏览器中打开它时,它会重定向。

On fetching the same link with Python Requests it simply shows the data from the original link. 在使用Python请求获取相同的链接时,它只显示原始链接中的数据。 I tried both Python Requests and urllib but had no success. 我尝试了Python请求和urllib但没有成功。

  1. How to capture the final URL and its data? 如何捕获最终的URL及其数据?

  2. How can a link with status 200 redirect? 状态200的链接如何重定向?

>>> url ='http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18'
>>> r = requests.get(url)
>>> r.url
'http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18'
>>> r.history
[]
>>> r.status_code
200

This is the link 这是链接

Redirected link 重定向链接

This kind of redirect is done by JavaScript. 这种重定向是由JavaScript完成的。 So, you won't directly get the redirected link using requests.get(...) . 因此,您不会使用requests.get(...)直接获取重定向的链接。 The original URL has the following page source: 原始URL具有以下页面源:

<html>
    <head>
        <meta http-equiv="refresh" content="0;URL=http://www.afaqs.com/interviews/index.html?id=572_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18">
        <script type="text/javascript" src="http://gc.kis.v2.scr.kaspersky-labs.com/D5838D60-3633-1046-AA3A-D5DDF145A207/main.js" charset="UTF-8"></script>
    </head>
    <body bgcolor="#FFFFFF"></body>
</html>

Here, you can see the redirected URL. 在这里,您可以看到重定向的URL。 Your job is to scrape that. 你的工作就是抓住这个。 You can do it using RegEx, or simply some string split operations. 您可以使用RegEx或简单的字符串拆分操作来完成。

For example: 例如:

r = requests.get('http://www.afaqs.com/news/story/52344_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18')
redirected_url = r.text.split('URL=')[1].split('">')[0]
print(redirected_url)
# http://www.afaqs.com/interviews/index.html?id=572_The-target-is-to-get-advertisers-to-switch-from-print-to-TV-Ravish-Kumar-Viacom18

r = requests.get(redirected_url)
# Start scraping from this link...

Or, using a regex: 或者,使用正则表达式:

redirected_url = re.findall(r'URL=(http.*)">', r.text)[0]

These kind of url's are present in script tag as they are javascript code. 这些url存在于脚本标记中,因为它们是javascript代码。 Therefore they are nor fetched by python. 因此它们也不是由python提取的。

To get the link simply extract them from their respective tags. 要获取链接,只需从各自的标签中提取它们即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM