简体   繁体   中英

How Incapsula works and how to beat it

Incapsula is a web application delivery platform that can be used to prevent scraping.

I am working in Python and Scrapy and I found this , but it seems to be out-of-date and not working with current Incapsula. I tested the Scrapy middleware with my target website and I got IndexErrors owing to the fact that the middleware was unable to extract some obfuscated parameter.

Is it possible to adapt this repo or has Incapsula now changed in its mode of operation?

I'm curious also as to how I can "copy as cURL" the request in from chrome dev tools to my target page, and the chrome response contains the user content, yet the curl response is an "incapsula incident" page. This is for chrome with cookies initially cleared.....

curl 'https://www.radarcupon.es/tienda/fotoprix.com' 
-H 'pragma: no-cache' -H 'dnt: 1' -H 'accept-encoding: gzip, deflate, br' 
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' 
-H 'upgrade-insecure-requests: 1' 
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/62.0.3202.94 Chrome/62.0.3202.94 Safari/537.36' 
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' 
-H 'cache-control: no-cache' -H 'authority: www.radarcupon.es'
 --compressed

I was expecting the first request from both to return something like a javascript challenge, which would set a cookie, but it doesn't seem to quite work like that now?

It's difficult to give a specific answer because Incapsula has a very detailed rules engine that can be used to block or challenge requests. Cookie detection and Javascript support are the two most common data points used to identify suspicious traffic; user agent strings, headers, and behavior originating from the client IP address (requests per minute, AJAX requests, etc) can also cause Incapsula to challenge traffic. The DDoS protection feature blocks requests aggressively if it is not configured sensibly relative to the amount of traffic a site sees.

There could be multiple reasons. It's hard to pin-point exactly what combination of rules Incapsula is applying to detect you as a bot. It could be using IP rate limitation, Browser fingerprinting, Header Validation, TCP/IP Fingerprinting, User-agent etc...

But you can try

  • Rotating IPs.

    You can easily find lists of free proxies on the internet, and you can use a solution like scrapy-rotating-proxies middleware to configure multiple proxies in your spider and have requests rotate through them automatically.

  • Rotating USER_AGENT.

    One way to navigate this filter is to switch your USER_AGENT to a value copied from those that popular web browsers use. In some rare cases, you may need a user agent string from a specific web browser. There are multiple Scrapy plugins that can rotate your requests through popular web browser user agent strings, such as scrapy-random-useragent or Scrapy- UserAgents .

  • You can try inspecting developer tools and reverse engineer the request parameters.

Mostly in such scenarios, the objective is to avoid getting banned by crawling with best practises in mind. you can read about them here. or you can try using dedicated tools for the same like Smart Proxy Manager or Smart Browser too. I work as a Developer Advocate @Zyte.

Incapsula, like many other anti-scraping services, uses 3 types of details to identify web scrapers:

  1. IP address meta information
  2. Javascript Fingerprinting
  3. Request analysis

To get around this protection, we need to ensure that these details match that of a common web user.

IP Addresses

A natural web user is usually connected from a residential or mobile IP address where many production scrapers are deployed on datacenter IP addresses (Google cloud, AWS etc.). These 3 types are very different and can be determined by analysis of IP databases. As the name implies: datacenter - commercial IP addresses, residential - household addresses, and mobile ones are cell tower-based mobile networks (3G, 4G, etc.)

So, we want to distribute our scraper network through a pool of residential or mobile proxies .

Javascript Fingerprinting

Using javascript, these services can analyze the browser environment and build a fingerprint. If we are running browser automation tools (like Selenium, Playwright, or Puppeteer) as web scrapers, we need to ensure that the browser environment appears to be user-like.

This is a huge subject, but a good start would be to take a look at what puppeteer-stealth plugin which applies patches to browser environment to hide various details that reveal the fact that the browser is being controlled by a script.

Note: puppeteer-stealth is incomplete and you need to do extra work to get pass Incapsula reliably.

SO answer is a bit short to cover this, but I wrote an extensive introduction on this subject on my blog How to Avoid Web Scraping Blocking: Javascript

Request Analysis

Finally, the way our scraper connects plays a huge role as well. Connection patterns can be used to determine whether the client is a real user or a bot. For example, real users usually navigate the website in more chaotic patterns like going to the home page, category pages, etc.

A stealthy scraper should introduce a bit of chaos into scraping connection patterns.

Curl is not going to cut it

As you're asking about using CURL since Incapsula relies on JS fingerprinting, you won't have much luck in this scenario. However, there are few things to note that might help with other systems:

  • HTTP2/3 protocol will have much higher success rate. Curl and many other http clients default to http 1.1 and majority of real user traffic runs http2+ - it's a dead giveaway.
  • Header values and ordering matters too as real browsers (Chrome, Firefox etc.) have specific header order and values. If your scraper connection differs - it's a dead giveaway.

Understanding these 3 details that differentiate bot traffic from real human traffic can help us to develop more stealthy scrapers. I wrote more on this subject on my blog if you'd like to learn more How to Scrape Without Getting Blocked

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM