简体   繁体   中英

Amazon Scraping - Scraping works sometimes

I'm scraping data from amazon for educational purposes and I have some problems with the cookies and the antibot. I managed to scrape data, but sometimes, the cookies will not be in the response, or the antibot flags me.

I already tried to use a random list of headers like this:

headers_list = [{
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-User": "?1",
    "TE": "trailers"
},
    {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "fr-FR,fr;q=0.7",
    "cache-control": "max-age=0",
    "content-type": "application/x-www-form-urlencoded",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "same-origin",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1"
    },
]

And put the following in my code:

    headers = random.choice(headers_list)
    with requests.Session() as s:
        res = s.get(url, headers=headers)
        if not res.cookies:
            print("Error getting cookies")
            raise SystemExit(1)

But this doesn't solve the issue, I still sometimes get no cookie in my response and detection from the antibot.

I am scraping the data like this:

     post = s.post(url, data=login_data, headers=headers, cookies=cookies, allow_redirects=True)
        soup = BeautifulSoup(post.text, 'html.parser')
        if soup.find('input', {'name': 'appActionToken'})['value'] is not None \
                and soup.find('input', {'name': 'appAction'})['value'] is not None \
                and soup.find('input', {'name': 'subPageType'})['value'] is not None \
                and soup.find('input', {'name': 'openid.return_to'})['value'] is not None \
                and soup.find('input', {'name': 'prevRID'})['value'] is not None \
                and soup.find('input', {'name': 'workflowState'})['value'] is not None \
                and soup.find('input', {'name': 'email'})['value'] is not None:
            print("found")
        else:
            print("not found")
            raise SystemExit(1)

But when the antibot detects me, this content will not be available, thus throwing out an error. Any idea on how I could prevent that? Thanks!

You can set a time.sleep(10) for a certain amount of time before each Scrape operation. It will be harder for Amazon to catch you, but if you send too many regular requests, they may detect and block them as well.

  • Rotate your request headers with random user agents (update your headers list with more useragents)

  • Remove everything (tracking parameters) coming after /dp/ASIN/ from the product url

    for example after removing tracking parameters your url will be like this: https://www.amazon.com/Storage-Stackable-Organizer-Foldable-Containers/dp/B097PVKRYM/

  • Add litle sleep in between requests (use time.sleep() )

  • Use proxy with your requests (you can use Tor proxy, if they blocks Tor go with some other paid proxy services)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM