简体   繁体   English

定时延迟或重定向后获取最终URL

[英]Get final url after timed delay or redirect

I am trying to scrape a website, but when I open the webpage it has 5 seconds redirect delay, ie you have to wait for 5 sec and then the real page loads. 我正在尝试抓取网站,但是当我打开网页时,它具有5秒钟的重定向延迟,即您必须等待5秒钟,然后才能加载真实页面。 I have tried the following code . 我尝试了以下代码。

from bs4 import BeautifulSoup
import time
import requests

r=requests.get("https://etherscan.io/address/0xc257274276a4e539741ca11b590b9447b26a8051",timeout=6)
time.sleep(5) 
print(r.history)

data=r.text

soup=BeautifulSoup(data)

print(soup.prettify())

But when I run the code I get the redirect page, not the final page. 但是,当我运行代码时,我得到的是重定向页面,而不是最终页面。 Thanks for help 感谢帮助

It looks like etherscan.io is protected by Cloudflare , and Cloudflare is causing the delayed redirect that you are seeing. 看起来etherscan.io受Cloudflare保护,并且Cloudflare导致您看到的延迟重定向。 One of the purposes of Cloudflare is to prevent bots from making automated requests to the site (which seems a lot like what you are doing). Cloudflare的目的之一是防止机器人向站点发出自动请求(这看起来很像您在做什么)。

Getting around Cloudflare will not be easy. 绕过Cloudflare并非易事。 First, you'll need to make your requests 'look like' they are coming from a real browser - meaning that the tool that you are using to make these requests needs to present the same request headers that a real browser would, handle cookies like a browser would, run javascript like a browser would, etc. 首先,您需要让您的请求“看起来像”来自真实的浏览器-这意味着您用来发出这些请求的工具需要呈现与真实浏览器相同的请求标头,以处理类似浏览器可以像浏览器一样运行javascript等。

Even if you succeed in doing all of the above, Cloudflare is likely to block your requests (or challenge them) after certain number of requests have been made over some period of time. 即使您成功完成上述所有操作,Cloudflare仍可能会在一段时间内发出一定数量的请求后阻止您的请求(或挑战您的请求)。

If you really really are set on using something other than selenium or the API (which would make the most sense), you could take a look at this . 如果你真的是在用比硒或API(这将使最有意义的)以外的其他设置,你可以看看这个 It's a scraper meant to handle cloudflare sites, but it requires some other things (most notably Node.js) to run. 这是一个用于处理cloudflare站点的刮板,但是它需要其他一些东西(最值得注意的是Node.js)才能运行。 While this is pretty neat, seems like a pain when there are easier solutions. 尽管这很整洁,但是当有更简单的解决方案时,这似乎很痛苦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM