简体   繁体   中英

How to bypass Incapsula with Python

I use Scrapy and I try to scrape this site that uses Incapsula

<meta name="robots" content="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">
</script>

I had already asked a Question about this issue 2 years ago, but this method ( Incapsula-Cracker ) does not work anymore.

I tried to understand How Incapsula works and I tried this for bypass it

def start_requests(self):
    yield Request('https://courses-en-ligne.carrefour.fr',  cookies={'store': 92}, dont_filter=True, callback = self.init_shop)
def init_shop(self,response) :
    result_content      = response.body
    RE_ENCODED_FUNCTION = re.compile('var b="(.*?)"', re.DOTALL)
    RE_INCAPSULA        = re.compile('(_Incapsula_Resource\?SWHANEDL=.*?)"')
    INCAPSULA_URL       = 'https://courses-en-ligne.carrefour.fr/%s'
    encoded_func        = RE_ENCODED_FUNCTION.search(result_content).group(1)
    decoded_func        = ''.join([chr(int(encoded_func[i:i+2], 16)) for i in xrange(0, len(encoded_func), 2)])
    incapsula_params    = RE_INCAPSULA.search(decoded_func).group(1)
    incap_url           = INCAPSULA_URL % incapsula_params
    yield Request(incap_url)
def parse(self):
    print response.body 

But i'm redirected to RE-Captcha Page

<html style="height:100%">
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>
<body style="margin:0px;height:100%">
<iframe src="/_Incapsula_Resource?CWUDNSAI=27&xinfo=3-10784678-0%200NNN%20RT%281523525225370%20394%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U10000&incident_id=459000960022408474-41333502566401539&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 459000960022408474-41333502566401539
</iframe>
</body>
</html>

So first of all there is no fool proof solutions to such problems. I as a actual user end-up having to solve captcha while answering on StackOverflow. Which means a bot will definitely get captchas.

Now there are few rules which I try and follow to decrease the chances of an captcha

  • Never ever use shared proxies for such projects. Using TOR is a big NO
  • Use Chrome + Selenium + Proxy
  • Use Chrome with existing profile . I prefer to have profiles which have browsing history with different websites, cookies from many other sites and trackers and going back month. You don't know how the evaluation of a user/bot difference may happen. So you want to look more like a real user
  • Never scrape at fast rates, use as many delays as possible and as random delays as possible
  • Always use a visible browser and keep monitoring the captcha, on captcha appearance manually solve the captcha or use a DeathByCaptcha or similar service. Try not to abort captcha pages as it may increase your bot probability check to a higher grade

This is a cat and mouse game, where you don't know what the other party has as a defense. So you try to play nice and easy

This is not the best answer but just giving some points to understand why is not that easy to do web scraping and mainly when having a CDN in front.

First, maybe good to check what you will be fighting against, WAF & Bot Mitigation .

Then to get more ideas, this is a good talk: How Attackers Circumvent CDNs to Attack Origin

Now, this doesn't mean it is not possible to do web scraping, the problem here now reduces to time/speed, the faster you try something high are the changes you trigger the captchas and in worst case even get full blocked.

There are multiple approaches like using different IP per requests: Make requests using Python over Tor , change the user agent, etc. But most of them are bound to a set of defined timeouts and query patterns that you may need to found.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM