简体   繁体   中英

puppeteer bypass cloudflare by enable cookies and Javascript

(In nodeJs -> server side only).

I'm doing some webscraping and some pages are protected by the cloudflare anti-ddos page. I'm trying to bypasse this page. By searching around I found a lot of article on the stealth methode or reCapcha. But the thing is cloudflare is not even trying to give me capcha, it keep being stuck on the page (wait for 5 secondes) because it display in red (TURN ON JAVASCRIPT AND RELOAD) and (TURN ON COOKIES AND RELOAD), by the way my javascript seems to be active because my programme run on a lot of website and it process the javascript.

This is my code:

//vm = this;
vm.puppeteer.use(vm.StealthPlugin())
vm.puppeteer.use(vm.AdblockerPlugin({
  blockTrackers: true
}))
let browser = await vm.puppeteer.launch({
  headless: true
});
let browserPage = await browser.newPage();
await browserPage.goto(link, {
  waitUntil: 'networkidle2',
  timeout: 40 * 1000
});
await browserPage.waitForTimeout(20 * 1000);
let body = await browserPage.evaluate(() => {
  return document.documentElement.outerHTML;
});

I also try to delete stealthPlugin and AdblockerPlugin but cloodflare keeping telling me there is no javascript and cookies.

Can anyone help me please?

Setting your own UserAgent and Accept-Language header should work because your headless browser needs to pretend like a real person who is browsing.

You can use page.setExtraHTTPHeaders() and page.setUserAgent() to do so.

await browserPage.setExtraHTTPHeaders({
 'Accept-Language': 'en'
});
// You can use any UserAgent you want
await browserPage.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');

You're probably just failing the challenges provided by Cloudflare. Web application firewall providers such as Cloudflare uses Javascript scripts to collect data from your browser, then encode and send them to the server for further analysis .

When a request is sent to the host, Cloudflare checks the request's headers along with the TLS&HTTP/2 fields to determine if the client is a bot or not. While TLS&HTTP/2 fingerprinting is static per request client , **it still sends the client's behavior to the server as an encoded payload.

These payloads are compared to the pre-collected fingerprint databases. If there's an inconsistency, you fail. When you fail these challenges, you will be redirected to the page you see.

You're already using puppeteer-extra and puppeteer-extra-plugin-stealth , which are usually the first steps. Those solve some of the problems mentioned above. If the problem is with your IP (or geolocation), you can check out the proxy-chain package. Also, you'll want to change your User-Agent and other request headers to imitate a real user as much as possible.

Since you haven't mentioned a website, here's a minimal working example using puppeteer-extra-plugin-stealth, proxy-chain (with HTTP proxy), and rand-user-agent :

const puppeteer = require('puppeteer-extra')
const randUserAgent = require('rand-user-agent');
const proxyChain = require('proxy-chain');
const StealthPlugin = require('puppeteer-extra-plugin-stealth')();

puppeteer.use(StealthPlugin)

const agent = randUserAgent("desktop");
const proxyUrl = 'http://userid:pw@ip:port'
const proxy = async () => {
  return proxyChain.anonymizeProxy(proxyUrl);
}

puppeteer.launch({ headless: true,
  args: [
    '--proxy-server='+proxy
  ]
}).then(async browser => {
  const page = await browser.newPage()
  
  await page.setUserAgent(agent);
  await page.goto('https://bot.sannysoft.com', { waitUntil: 'networkidle2' })
  await page.waitForTimeout(5000) 
  await page.screenshot({ path: 'test.png', fullPage: true })
  await browser.close()
})

Of course, you can still fail the challenges since it's really difficult to imitate a real user. For more information about Cloudflare's methods: bypass Cloudflare .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM