I use Scrapy and I try to scrape this site that uses Incapsula
<meta name="robots" content="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">
</script>
I had already asked a Question about this issue 2 years ago, but this method ( Incapsula-Cracker ) does not work anymore.
I tried to understand How Incapsula works and I tried this for bypass it
def start_requests(self):
yield Request('https://courses-en-ligne.carrefour.fr', cookies={'store': 92}, dont_filter=True, callback = self.init_shop)
def init_shop(self,response) :
result_content = response.body
RE_ENCODED_FUNCTION = re.compile('var b="(.*?)"', re.DOTALL)
RE_INCAPSULA = re.compile('(_Incapsula_Resource\?SWHANEDL=.*?)"')
INCAPSULA_URL = 'https://courses-en-ligne.carrefour.fr/%s'
encoded_func = RE_ENCODED_FUNCTION.search(result_content).group(1)
decoded_func = ''.join([chr(int(encoded_func[i:i+2], 16)) for i in xrange(0, len(encoded_func), 2)])
incapsula_params = RE_INCAPSULA.search(decoded_func).group(1)
incap_url = INCAPSULA_URL % incapsula_params
yield Request(incap_url)
def parse(self):
print response.body
But i'm redirected to RE-Captcha Page
<html style="height:100%">
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>
<body style="margin:0px;height:100%">
<iframe src="/_Incapsula_Resource?CWUDNSAI=27&xinfo=3-10784678-0%200NNN%20RT%281523525225370%20394%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U10000&incident_id=459000960022408474-41333502566401539&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 459000960022408474-41333502566401539
</iframe>
</body>
</html>
So first of all there is no fool proof solutions to such problems. I as a actual user end-up having to solve captcha while answering on StackOverflow. Which means a bot will definitely get captchas.
Now there are few rules which I try and follow to decrease the chances of an captcha
TOR
is a big NO
Chrome
+ Selenium
+ Proxy
existing profile
. I prefer to have profiles which have browsing history with different websites, cookies from many other sites and trackers and going back month. You don't know how the evaluation of a user/bot difference may happen. So you want to look more like a real user This is a cat and mouse game, where you don't know what the other party has as a defense. So you try to play nice and easy
This is not the best answer but just giving some points to understand why is not that easy to do web scraping and mainly when having a CDN in front.
First, maybe good to check what you will be fighting against, WAF & Bot Mitigation .
Then to get more ideas, this is a good talk: How Attackers Circumvent CDNs to Attack Origin
Now, this doesn't mean it is not possible to do web scraping, the problem here now reduces to time/speed, the faster you try something high are the changes you trigger the captchas and in worst case even get full blocked.
There are multiple approaches like using different IP per requests: Make requests using Python over Tor , change the user agent, etc. But most of them are bound to a set of defined timeouts and query patterns that you may need to found.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.