简体   繁体   English

亚马逊使用 bs4、请求阻止了 Python 3 抓取

[英]Amazon blocked Python 3 scraping using bs4, requests

This code works fine a few days ago when I run it:几天前,当我运行它时,此代码工作正常:

from bs4 import BeautifulSoup
import datetime
import requests

def getWeekMostRead(date):
    nonfiction_page = requests.get("https://www.amazon.com/charts/"+date.isoformat()+"/mostread/nonfiction")
    content = "amazon"+date.isoformat()+"_nonfiction.html"
    with open(content, "w", encoding="utf-8") as nf_file:
        print(nonfiction_page.content, file=nf_file)

    mostRead_nonfiction = BeautifulSoup(nonfiction_page.content, features="html.parser")

    nonfiction = mostRead_nonfiction.find_all("div", class_="kc-horizontal-rank-card")

    mostread = []
    for books in nonfiction:
        if books.find(class_="kc-rank-card-publisher") is None:
            mostread.append((
                books.find(class_="kc-rank-card-title").string.strip(),
                books.find(class_="kc-rank-card-author").string.strip(),
                "",
                books.find(class_="numeric-star-data").small.string.strip()
            ))
        else:
            mostread.append((
                books.find(class_="kc-rank-card-title").string.strip(),
                books.find(class_="kc-rank-card-author").string.strip(),
                books.find(class_="kc-rank-card-publisher").string.strip(),
                books.find(class_="numeric-star-data").small.string.strip()
            ))
    return mostread

mostread = []
date = datetime.date(2020,1,1)
while date >= datetime.date(2015,1,1):
    print("Scraped data from "+date.isoformat())
    mostread.extend(getWeekMostRead(date))
    date -= datetime.timedelta(7)
print("Currently saving scraped data to AmazonCharts.csv")
with open("AmazonCharts.csv", "w") as csv:
    counter = 0
    print("ID,Title,Author,Publisher,Rating", file=csv)
    for book in mostread:
        counter += 1
        print('AmazonCharts'+str(counter)+',"'+book[0]+'","'+book[1]+'","'+book[2]+'","'+book[3]+'"', file=csv)
    csv.close()

For some reason, today I tried to run it again and I got this included in the returned HTML file:出于某种原因,今天我尝试再次运行它,并将其包含在返回的 HTML 文件中:

To discuss automated access to Amazon data please contact api-services-support@amazon.com.\r\n\r\nFor information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.

I understand that Amazon is a heavy anti-scraping data (or at least I read so from some replies and threads).我知道亚马逊是一个沉重的反抓取数据(或者至少我从一些回复和线程中读到了这一点)。 I tried to use headers and delays in the code but it does not work.我试图在代码中使用标题和延迟,但它不起作用。 Would there be another way to try this?有没有其他方法可以尝试这个? Or if I should wait, how long should I wait?或者如果我应该等,我应该等多久?

As you noted, Amazon is very anti-scraping.正如您所指出的,亚马逊非常反抓取。 There's an entire industry built around scraping data from Amazon, and Amazon has its own API access to sell, so it's in their best interest to stop people from freely grabbing data from their pages.整个行业都围绕从 Amazon 抓取数据而建立,而 Amazon 拥有自己的 API 访问权限来进行销售,因此阻止人们从他们的页面自由抓取数据符合他们的最大利益。

Based on your code, I suspect you made too many requests too quickly and were IP banned.根据您的代码,我怀疑您过快地提出了太多请求并且被 IP 禁止。 When scraping sites, it's usually best to scrape responsibly by not going too fast, rotating user agents, and rotating IPs through a proxy service.在抓取站点时,通常最好通过不要太快、轮换用户代理和通过代理服务轮换 IP 来负责任地抓取。

To seem less programmatic, you can also try randomizing request timing to seem more human.为了看起来不那么程序化,您还可以尝试随机化请求时间以看起来更人性化。

Even with all of that, you'll still likely hit issues with this.即使有了所有这些,您仍然可能会遇到此问题。 Amazon is not an easy site to reliably scrape.亚马逊不是一个容易可靠抓取的网站。

You can try add User-Agent in headers in the requests use this one您可以尝试在请求的标头中添加User-Agent使用这个

headers = {
    'User-Agent': 'My User Agent 1.0',
    'From': 'personal@domain.com'  # This is another valid field
}

url = "YOURLINK"
req = requests.get(url, headers=headers)

Should be ok.应该可以。

After a while, I have figured the solution.过了一会儿,我想出了解决办法。 It is rather simply - there was no "2020-01-01" on Amazon, instead, I fixed it "2020-01-05".很简单 - 亚马逊上没有“2020-01-01”,相反,我将其修复为“2020-01-05”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM