亞馬遜抓取 - 抓取有時有效

Question

我出於教育目的從亞馬遜抓取數據，我在使用 cookies 和 antibot 時遇到了一些問題。 我設法抓取了數據，但有時 cookies 不會出現在響應中，或者反機器人標記了我。

我已經嘗試使用這樣的隨機標題列表：

headers_list = [{
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-User": "?1",
    "TE": "trailers"
},
    {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "fr-FR,fr;q=0.7",
    "cache-control": "max-age=0",
    "content-type": "application/x-www-form-urlencoded",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "same-origin",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1"
    },
]

並將以下內容放入我的代碼中：

    headers = random.choice(headers_list)
    with requests.Session() as s:
        res = s.get(url, headers=headers)
        if not res.cookies:
            print("Error getting cookies")
            raise SystemExit(1)

但這並不能解決問題，我有時仍然無法在反機器人的響應和檢測中收到任何 cookie。

我正在像這樣抓取數據：

     post = s.post(url, data=login_data, headers=headers, cookies=cookies, allow_redirects=True)
        soup = BeautifulSoup(post.text, 'html.parser')
        if soup.find('input', {'name': 'appActionToken'})['value'] is not None \
                and soup.find('input', {'name': 'appAction'})['value'] is not None \
                and soup.find('input', {'name': 'subPageType'})['value'] is not None \
                and soup.find('input', {'name': 'openid.return_to'})['value'] is not None \
                and soup.find('input', {'name': 'prevRID'})['value'] is not None \
                and soup.find('input', {'name': 'workflowState'})['value'] is not None \
                and soup.find('input', {'name': 'email'})['value'] is not None:
            print("found")
        else:
            print("not found")
            raise SystemExit(1)

但是當 antibot 檢測到我時，這個內容將不可用，從而拋出錯誤。 關於如何防止這種情況的任何想法？ 謝謝！

Answer 1

您可以在每次 Scrape 操作之前設置一個time.sleep(10)一段時間。 亞馬遜將更難抓住你，但如果你發送太多常規請求，他們也可能會檢測並阻止它們。

Answer 2

使用隨機用戶代理輪換您的請求標頭（使用更多用戶代理更新您的標頭列表）
從產品 url 中刪除/dp/ASIN/之后的所有內容（跟蹤參數）
例如，在刪除跟蹤參數后，您的 url 將如下所示： https://www.amazon.com/Storage-Stackable-Organizer-Foldable-Containers/dp/B097PVKRYM/
在請求之間添加小睡眠（使用time.sleep() ）
對您的請求使用代理（您可以使用 Tor 代理，如果它們使用其他付費代理服務阻止 Tor go）

亞馬遜抓取 - 抓取有時有效

問題描述

2 個解決方案

解決方案1
0 2023-01-15 15:52:49

解決方案2
0 2023-01-15 18:10:48

亞馬遜抓取 - 抓取有時有效

問題描述

2 個解決方案

解決方案1 0 2023-01-15 15:52:49

解決方案2 0 2023-01-15 18:10:48

解決方案1
0 2023-01-15 15:52:49

解決方案2
0 2023-01-15 18:10:48