简体   繁体   中英

python requests problem: cloudflare error message "enable cookies"

I was planning on creating a basic web scraper for the site Sneakersnstuff.com however my efforts were stopped early due to an error. When requesting to the url https://www.sneakersnstuff.com/ , rather than displaying the html of the website, or even the entrance captcha, I am redirected to a cloudflare page with the error message "enable cookies". Both my code and the response are shown below

import requests
import cfscrape


session = requests.session()

response = session.get('https://www.sneakersnstuff.com/')

print(response.headers)
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-US">
<!--<![endif]-->

<head>
    <title>Access denied | www.sneakersnstuff.com used Cloudflare to restrict access</title>
    <meta charset="UTF-8" />
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
    <meta name="robots" content="noindex, nofollow" />
    <meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
    <link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css"
        media="screen,projection" />
    <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
    <style type="text/css">
        body {
            margin: 0;
            padding: 0
        }
    </style>


    <!--[if gte IE 10]><!-->
    <script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script>
    <!--<![endif]-->
    <!--[if gte IE 10]><!-->
    <script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script>
    <!--<![endif]-->



</head>

<body>
    <div id="cf-wrapper">
        <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please
            enable cookies.</div>
        <div id="cf-error-details" class="cf-error-details-wrapper">
            <div class="cf-wrapper cf-header cf-error-overview">
                <h1>
                    <span class="cf-error-type" data-translate="error">Error</span>
                    <span class="cf-error-code">1020</span>
                    <small class="heading-ray-id">Ray ID: 578133293d83e0d6 &bull; 2020-03-22 16:13:25 UTC</small>
                </h1>
                <h2 class="cf-subheadline">Access denied</h2>
            </div><!-- /.header -->

            <section></section><!-- spacer -->

            <div class="cf-section cf-wrapper">
                <div class="cf-columns two">
                    <div class="cf-column">
                        <h2 data-translate="what_happened">What happened?</h2>
                        <p>This website is using a security service to protect itself from online attacks.</p>

                    </div>



                </div>
            </div><!-- /.section -->

            <div class="cf-error-footer cf-wrapper">
                <p>
                    <span class="cf-footer-item">Cloudflare Ray ID: <strong>578133293d83e0d6</strong></span>
                    <span class="cf-footer-separator">&bull;</span>
                    <span class="cf-footer-item"><span>Your IP</span>: 96.241.108.243</span>
                    <span class="cf-footer-separator">&bull;</span>
                    <span class="cf-footer-item"><span>Performance &amp; security by</span> <a
                        href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link"
                        target="_blank">Cloudflare</a></span>

                </p>
            </div><!-- /.error-footer -->


        </div><!-- /#cf-error-details -->
    </div><!-- /#cf-wrapper -->

    <script type="text/javascript">
        window._cf_translation = {};


    </script>

</body>

</html>

I have attempted using a library reccomend by many called cfscrape to no avail.

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper()

html = scraper.get("https://www.sneakersnstuff.com/").content

soup = BeautifulSoup(html, 'html.parser')

print(soup)

Output:

cloudscraper.exceptions.CloudflareReCaptchaProvider: Cloudflare reCaptcha detected, unfortunately you haven't loaded an anti reCaptcha provider correctly via the 'recaptcha' parameter.

Next Step ?

3rd Party reCaptcha Solvers Description

cloudscraper currently supports the following 3rd party reCaptcha solvers, should you require them.

anticaptcha
deathbycaptcha
2captcha
9kw
return_response

Adding Browser/User-Agent Filtering to cloudscraper did the trick for me.

import cloudscraper
from bs4 import BeautifulSoup

# Adding Browser / User-Agent Filtering should help ie. 

# will give you only desktop firefox User-Agents on Windows
scraper = cloudscraper.create_scraper(browser={'browser': 'firefox','platform': 'windows','mobile': False})

html = scraper.get("https://www.sneakersnstuff.com/").content

soup = BeautifulSoup(html, 'html.parser')

print(soup)

Scraping CloudFlare-protected websites usually involves using clean proxies, and emulating real browser via solutions like Puppeteer.js.

I encountered the same issue when scraping one ecommerce website (guess dot com). Apparently, CloudFlare analyses the TLS fingerprint of the request and throws 403 (1020) code in case the fingerprint matches node.js/python/curl which are usually used for scraping. The solution is to emulate the fingeprint of some popular browser - and the most obvious way would be to use Puppeteer.js with puppeteer extra stealth plugin. But.. since Puppeteer was not fast enough for my use case (I put it mildly.. Puppeteer is insane in terms of resources and sluggishness) I had to build an utility which uses boringSSL (the SSL lib used by Chrome) - and since compiling C/C++ code and figuring out the cryptic compilation errors of some TLS library is no fun for most of web devs - I wrapped it as an API server, which you can try here: https://rapidapi.com/restyler/api/scrapeninja

When using requests, I solved the issue for me by providing a user agent in the headers that is supported. Before I used a user agent that caused problem. Now I changed it to Mozilla ( Sending "User-agent" using Requests library in Python ) which works.

Unfortunately, the response messages were not really helpful to find out what the issue was.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM