亞馬遜使用 bs4、請求阻止了 Python 3 抓取

Question

幾天前，當我運行它時，此代碼工作正常：

from bs4 import BeautifulSoup
import datetime
import requests

def getWeekMostRead(date):
    nonfiction_page = requests.get("https://www.amazon.com/charts/"+date.isoformat()+"/mostread/nonfiction")
    content = "amazon"+date.isoformat()+"_nonfiction.html"
    with open(content, "w", encoding="utf-8") as nf_file:
        print(nonfiction_page.content, file=nf_file)

    mostRead_nonfiction = BeautifulSoup(nonfiction_page.content, features="html.parser")

    nonfiction = mostRead_nonfiction.find_all("div", class_="kc-horizontal-rank-card")

    mostread = []
    for books in nonfiction:
        if books.find(class_="kc-rank-card-publisher") is None:
            mostread.append((
                books.find(class_="kc-rank-card-title").string.strip(),
                books.find(class_="kc-rank-card-author").string.strip(),
                "",
                books.find(class_="numeric-star-data").small.string.strip()
            ))
        else:
            mostread.append((
                books.find(class_="kc-rank-card-title").string.strip(),
                books.find(class_="kc-rank-card-author").string.strip(),
                books.find(class_="kc-rank-card-publisher").string.strip(),
                books.find(class_="numeric-star-data").small.string.strip()
            ))
    return mostread

mostread = []
date = datetime.date(2020,1,1)
while date >= datetime.date(2015,1,1):
    print("Scraped data from "+date.isoformat())
    mostread.extend(getWeekMostRead(date))
    date -= datetime.timedelta(7)
print("Currently saving scraped data to AmazonCharts.csv")
with open("AmazonCharts.csv", "w") as csv:
    counter = 0
    print("ID,Title,Author,Publisher,Rating", file=csv)
    for book in mostread:
        counter += 1
        print('AmazonCharts'+str(counter)+',"'+book[0]+'","'+book[1]+'","'+book[2]+'","'+book[3]+'"', file=csv)
    csv.close()

出於某種原因，今天我嘗試再次運行它，並將其包含在返回的 HTML 文件中：

To discuss automated access to Amazon data please contact api-services-support@amazon.com.\r\n\r\nFor information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.

我知道亞馬遜是一個沉重的反抓取數據（或者至少我從一些回復和線程中讀到了這一點）。 我試圖在代碼中使用標題和延遲，但它不起作用。 有沒有其他方法可以嘗試這個？ 或者如果我應該等，我應該等多久？

Answer 1

正如您所指出的，亞馬遜非常反抓取。 整個行業都圍繞從 Amazon 抓取數據而建立，而 Amazon 擁有自己的 API 訪問權限來進行銷售，因此阻止人們從他們的頁面自由抓取數據符合他們的最大利益。

根據您的代碼，我懷疑您過快地提出了太多請求並且被 IP 禁止。 在抓取站點時，通常最好通過不要太快、輪換用戶代理和通過代理服務輪換 IP 來負責任地抓取。

為了看起來不那么程序化，您還可以嘗試隨機化請求時間以看起來更人性化。

即使有了所有這些，您仍然可能會遇到此問題。 亞馬遜不是一個容易可靠抓取的網站。

Answer 2

您可以嘗試在請求的標頭中添加User-Agent使用這個

headers = {
    'User-Agent': 'My User Agent 1.0',
    'From': 'personal@domain.com'  # This is another valid field
}

url = "YOURLINK"
req = requests.get(url, headers=headers)

應該可以。

Answer 3

過了一會兒，我想出了解決辦法。 很簡單 - 亞馬遜上沒有“2020-01-01”，相反，我將其修復為“2020-01-05”。

亞馬遜使用 bs4、請求阻止了 Python 3 抓取

問題描述

3 個解決方案

解決方案1
5 2020-03-12 06:28:06

解決方案2
0 2021-02-08 17:18:05

解決方案3
-1 已采納 2020-03-12 08:01:38

亞馬遜使用 bs4、請求阻止了 Python 3 抓取

問題描述

3 個解決方案

解決方案1 5 2020-03-12 06:28:06

解決方案2 0 2021-02-08 17:18:05

解決方案3 -1 已采納 2020-03-12 08:01:38

解決方案1
5 2020-03-12 06:28:06

解決方案2
0 2021-02-08 17:18:05

解決方案3
-1 已采納 2020-03-12 08:01:38