繁体   English   中英

美丽的汤解析亚马逊页面

[英]Beautiful Soup parsing amazon page

嗨,我正在尝试解析亚马逊的页面以获取书籍详细信息,所以我在使用漂亮的汤

链接: https//www.amazon.com/Dogs-Purpose-Novel-Humans/dp/0765326264/ref=sr_1_1?s = electronics&ie = UTF8&qid = 1489776209&sr = 1-1&keywords = books

from bs4 import BeautifulSoup
import requests

url = raw_input("Enter a website to extract the URL's from: ")
r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data, "lxml")

#Grab book details
print soup.find("table", {"id": "productDetailsTable" })

但是,当我尝试此代码时,结果为None,我确定id productDetailsTable存在,并且当我尝试使用虚拟html运行此代码时,它是否不适用于url?

我没有在https://www.amazon.com上看到productDetailsTable

我必须做https://www.amazon.com/才能接收html数据。

这是我经过稍微修改的Python 3代码。

from bs4 import BeautifulSoup
import requests

url = input("Enter a website to extract the URL's from: ")
r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data, "lxml")

print(soup.text)

它为页面打印html。

您会注意到,亚马逊很聪明。 html包含漫游器检查:

if (true === true) {
var ue_t0 = (+ new Date()),
    ue_csm = window,
    ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
    ue_furl = "fls-na.amazon.com",
    ue_mid = "ATVPDKIKX0DER",
    ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
    ue_sn = "opfcaptcha.amazon.com",
    ue_id = 'R8D7EEN5FVS7RWC2M549';
}
Enter the characters you see below
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.

它使您无法阅读亚马逊的页面。 您可能需要做更多的工作,可能需要处理请求,并包含标头cookie信息。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM