[英]How to bypass Incapsula with Python
I use Scrapy and I try to scrape this site that uses Incapsula 我使用Scrapy ,我试图刮掉这个使用Incapsula的 网站
<meta name="robots" content="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">
</script>
I had already asked a Question about this issue 2 years ago, but this method ( Incapsula-Cracker ) does not work anymore. 我已经问了一个问题关于这个问题2年前,但这种方法( Incapsula-饼干 )不工作了。
I tried to understand How Incapsula works and I tried this for bypass it 我试图理解Incapsula是如何工作的 ,我试图绕过它
def start_requests(self):
yield Request('https://courses-en-ligne.carrefour.fr', cookies={'store': 92}, dont_filter=True, callback = self.init_shop)
def init_shop(self,response) :
result_content = response.body
RE_ENCODED_FUNCTION = re.compile('var b="(.*?)"', re.DOTALL)
RE_INCAPSULA = re.compile('(_Incapsula_Resource\?SWHANEDL=.*?)"')
INCAPSULA_URL = 'https://courses-en-ligne.carrefour.fr/%s'
encoded_func = RE_ENCODED_FUNCTION.search(result_content).group(1)
decoded_func = ''.join([chr(int(encoded_func[i:i+2], 16)) for i in xrange(0, len(encoded_func), 2)])
incapsula_params = RE_INCAPSULA.search(decoded_func).group(1)
incap_url = INCAPSULA_URL % incapsula_params
yield Request(incap_url)
def parse(self):
print response.body
But i'm redirected to RE-Captcha Page 但我被重定向到RE-Captcha Page
<html style="height:100%">
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>
<body style="margin:0px;height:100%">
<iframe src="/_Incapsula_Resource?CWUDNSAI=27&xinfo=3-10784678-0%200NNN%20RT%281523525225370%20394%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U10000&incident_id=459000960022408474-41333502566401539&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 459000960022408474-41333502566401539
</iframe>
</body>
</html>
So first of all there is no fool proof solutions to such problems. 首先,没有针对此类问题的万无一失的解决方案。 I as a actual user end-up having to solve captcha while answering on StackOverflow.
作为一个实际的用户,我最终必须在回答StackOverflow时解决验证码问题。 Which means a bot will definitely get captchas.
这意味着机器人肯定会获得验证码。
Now there are few rules which I try and follow to decrease the chances of an captcha 现在,我尝试遵循的规则很少,以减少验证码的可能性
TOR
is a big NO
TOR
是一个很大的NO
Chrome
+ Selenium
+ Proxy
Chrome
+ Selenium
+ Proxy
existing profile
. existing profile
I prefer to have profiles which have browsing history with different websites, cookies from many other sites and trackers and going back month. This is a cat and mouse game, where you don't know what the other party has as a defense. 这是一个猫捉老鼠的游戏,你不知道对方有什么防守。 So you try to play nice and easy
所以你试着玩得很开心
This is not the best answer but just giving some points to understand why is not that easy to do web scraping and mainly when having a CDN in front. 这不是最好的答案,只是给出了一些观点来理解为什么网络抓取并不容易,主要是在前面有CDN时。
First, maybe good to check what you will be fighting against, WAF & Bot Mitigation . 首先,可能很好地检查你将要对抗的是什么, WAF和Bot Mitigation 。
Then to get more ideas, this is a good talk: How Attackers Circumvent CDNs to Attack Origin 然后为了得到更多的想法,这是一个很好的谈话: 攻击者如何绕过CDN来攻击原点
Now, this doesn't mean it is not possible to do web scraping, the problem here now reduces to time/speed, the faster you try something high are the changes you trigger the captchas and in worst case even get full blocked. 现在,这并不意味着不可能进行网络抓取,这里的问题现在减少到时间/速度,你尝试的东西越快,你触发验证码的变化就越大,最坏的情况下甚至会被完全阻止。
There are multiple approaches like using different IP per requests: Make requests using Python over Tor , change the user agent, etc. But most of them are bound to a set of defined timeouts and query patterns that you may need to found. 有多种方法,例如每个请求使用不同的IP: 使用Python通过Tor创建请求 ,更改用户代理等。但是大多数方法都绑定了一组您可能需要找到的定义的超时和查询模式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.