简体   繁体   English

如何绕过Incapsula与Python

[英]How to bypass Incapsula with Python

I use Scrapy and I try to scrape this site that uses Incapsula 我使用Scrapy ,我试图刮掉这个使用Incapsula的 网站

<meta name="robots" content="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">
</script>

I had already asked a Question about this issue 2 years ago, but this method ( Incapsula-Cracker ) does not work anymore. 我已经问了一个问题关于这个问题2年前,但这种方法( Incapsula-饼干 )不工作了。

I tried to understand How Incapsula works and I tried this for bypass it 我试图理解Incapsula是如何工作的 ,我试图绕过它

def start_requests(self):
    yield Request('https://courses-en-ligne.carrefour.fr',  cookies={'store': 92}, dont_filter=True, callback = self.init_shop)
def init_shop(self,response) :
    result_content      = response.body
    RE_ENCODED_FUNCTION = re.compile('var b="(.*?)"', re.DOTALL)
    RE_INCAPSULA        = re.compile('(_Incapsula_Resource\?SWHANEDL=.*?)"')
    INCAPSULA_URL       = 'https://courses-en-ligne.carrefour.fr/%s'
    encoded_func        = RE_ENCODED_FUNCTION.search(result_content).group(1)
    decoded_func        = ''.join([chr(int(encoded_func[i:i+2], 16)) for i in xrange(0, len(encoded_func), 2)])
    incapsula_params    = RE_INCAPSULA.search(decoded_func).group(1)
    incap_url           = INCAPSULA_URL % incapsula_params
    yield Request(incap_url)
def parse(self):
    print response.body 

But i'm redirected to RE-Captcha Page 但我被重定向到RE-Captcha Page

<html style="height:100%">
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>
<body style="margin:0px;height:100%">
<iframe src="/_Incapsula_Resource?CWUDNSAI=27&xinfo=3-10784678-0%200NNN%20RT%281523525225370%20394%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U10000&incident_id=459000960022408474-41333502566401539&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 459000960022408474-41333502566401539
</iframe>
</body>
</html>

So first of all there is no fool proof solutions to such problems. 首先,没有针对此类问题的万无一失的解决方案。 I as a actual user end-up having to solve captcha while answering on StackOverflow. 作为一个实际的用户,我最终必须在回答StackOverflow时解决验证码问题。 Which means a bot will definitely get captchas. 这意味着机器人肯定会获得验证码。

Now there are few rules which I try and follow to decrease the chances of an captcha 现在,我尝试遵循的规则很少,以减少验证码的可能性

  • Never ever use shared proxies for such projects. 永远不要为这些项目使用共享代理。 Using TOR is a big NO 使用TOR是一个很大的NO
  • Use Chrome + Selenium + Proxy 使用Chrome + Selenium + Proxy
  • Use Chrome with existing profile . 将Chrome与existing profile I prefer to have profiles which have browsing history with different websites, cookies from many other sites and trackers and going back month. 我更喜欢拥有具有不同网站浏览历史的配置文件,来自许多其他网站和跟踪器的cookie以及返回月份。 You don't know how the evaluation of a user/bot difference may happen. 您不知道如何评估用户/机器人差异。 So you want to look more like a real user 所以你想看起来更像真实的用户
  • Never scrape at fast rates, use as many delays as possible and as random delays as possible 永远不要快速,尽可能多地使用延迟和随机延迟
  • Always use a visible browser and keep monitoring the captcha, on captcha appearance manually solve the captcha or use a DeathByCaptcha or similar service. 始终使用可见的浏览器并持续监控验证码,在验证码外观上手动解决验证码或使用DeathByCaptcha或类似服务。 Try not to abort captcha pages as it may increase your bot probability check to a higher grade 尽量不要中止验证码页面,因为它可能会将机器人概率检查提高到更高的等级

This is a cat and mouse game, where you don't know what the other party has as a defense. 这是一个猫捉老鼠的游戏,你不知道对方有什么防守。 So you try to play nice and easy 所以你试着玩得很开心

This is not the best answer but just giving some points to understand why is not that easy to do web scraping and mainly when having a CDN in front. 这不是最好的答案,只是给出了一些观点来理解为什么网络抓取并不容易,主要是在前面有CDN时。

First, maybe good to check what you will be fighting against, WAF & Bot Mitigation . 首先,可能很好地检查你将要对抗的是什么, WAFBot Mitigation

Then to get more ideas, this is a good talk: How Attackers Circumvent CDNs to Attack Origin 然后为了得到更多的想法,这是一个很好的谈话: 攻击者如何绕过CDN来攻击原点

Now, this doesn't mean it is not possible to do web scraping, the problem here now reduces to time/speed, the faster you try something high are the changes you trigger the captchas and in worst case even get full blocked. 现在,这并不意味着不可能进行网络抓取,这里的问题现在减少到时间/速度,你尝试的东西越快,你触发验证码的变化就越大,最坏的情况下甚至会被完全阻止。

There are multiple approaches like using different IP per requests: Make requests using Python over Tor , change the user agent, etc. But most of them are bound to a set of defined timeouts and query patterns that you may need to found. 有多种方法,例如每个请求使用不同的IP: 使用Python通过Tor创建请求 ,更改用户代理等。但是大多数方法都绑定了一组您可能需要找到的定义的超时和查询模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM