简体   繁体   English

美丽的汤解析亚马逊页面

[英]Beautiful Soup parsing amazon page

Hi I'm trying to parse Amazon's page for book details so I'm using beautiful soup 嗨,我正在尝试解析亚马逊的页面以获取书籍详细信息,所以我在使用漂亮的汤

link: https://www.amazon.com/Dogs-Purpose-Novel-Humans/dp/0765326264/ref=sr_1_1?s=electronics&ie=UTF8&qid=1489776209&sr=1-1&keywords=books 链接: https//www.amazon.com/Dogs-Purpose-Novel-Humans/dp/0765326264/ref=sr_1_1?s = electronics&ie = UTF8&qid = 1489776209&sr = 1-1&keywords = books

from bs4 import BeautifulSoup
import requests

url = raw_input("Enter a website to extract the URL's from: ")
r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data, "lxml")

#Grab book details
print soup.find("table", {"id": "productDetailsTable" })

But when I try this code I get None as a result, I'm sure the id productDetailsTable exist, and when I try running this code with dummy html it works just not with a url? 但是,当我尝试此代码时,结果为None,我确定id productDetailsTable存在,并且当我尝试使用虚拟html运行此代码时,它是否不适用于url?

I did not see productDetailsTable on https://www.amazon.com 我没有在https://www.amazon.com上看到productDetailsTable

I had to do https://www.amazon.com/ in order to receive the html data. 我必须做https://www.amazon.com/才能接收html数据。

Here is my slightly modified Python 3 code. 这是我经过稍微修改的Python 3代码。

from bs4 import BeautifulSoup
import requests

url = input("Enter a website to extract the URL's from: ")
r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data, "lxml")

print(soup.text)

It prints the html for the page. 它为页面打印html。

You'll notice that amazon is smart. 您会注意到,亚马逊很聪明。 The html includes the Robot Check: html包含漫游器检查:

if (true === true) {
var ue_t0 = (+ new Date()),
    ue_csm = window,
    ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
    ue_furl = "fls-na.amazon.com",
    ue_mid = "ATVPDKIKX0DER",
    ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
    ue_sn = "opfcaptcha.amazon.com",
    ue_id = 'R8D7EEN5FVS7RWC2M549';
}
Enter the characters you see below
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.

It is keeping you from reading Amazon's page. 它使您无法阅读亚马逊的页面。 You'll have to do more, probably with requests and include headers and cookie information. 您可能需要做更多的工作,可能需要处理请求,并包含标头cookie信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM