简体   繁体   English

为什么我不能通过 BeautifulSoup 抓取亚马逊?

[英]Why can't I scrape Amazon by BeautifulSoup?

Here is my python code:这是我的python代码:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

it works for google.com and many other websites, but it doesn't work for amazon.com.它适用于 google.com 和许多其他网站,但不适用于 amazon.com。

I can open amazon.com in my browser, but the resulting "soup" is still none.我可以在浏览器中打开 amazon.com,但结果“汤”仍然没有。

Besides, I find that it cannot scrape from appannie.com, either.此外,我发现它也无法从 appannie.com 抓取。 However, rather than give none, the code returns an error:但是,该代码并没有给出 none,而是返回一个错误:

HTTPError: HTTP Error 503: Service Temporarily Unavailable 

So I doubt whether Amazon and App Annie block scraping.所以我怀疑亚马逊和 App Annie 是否会阻止抓取。

Add a header, then it will work.添加一个标题,然后它将起作用。

from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"

# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup

You can try this:你可以试试这个:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

In python arbitrary text is called a string and it must be enclosed in quotes(" ").在 python 中,任意文本被称为字符串,它必须用引号 (" ") 括起来。

I just ran into this and found that setting any user-agent will work.我刚刚遇到这个问题,发现设置任何用户代理都可以。 You don't need to lie about your user agent.你不需要对你的用户代理撒谎。

response = HTTParty.get @url, headers: {'User-Agent' => 'Httparty'}

Add a header添加标题

import urllib2
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM