[英]Beautiful Soup parsing amazon page
嗨,我正在尝试解析亚马逊的页面以获取书籍详细信息,所以我在使用漂亮的汤
链接: https : //www.amazon.com/Dogs-Purpose-Novel-Humans/dp/0765326264/ref=sr_1_1?s = electronics&ie = UTF8&qid = 1489776209&sr = 1-1&keywords = books
from bs4 import BeautifulSoup
import requests
url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
#Grab book details
print soup.find("table", {"id": "productDetailsTable" })
但是,当我尝试此代码时,结果为None,我确定id productDetailsTable存在,并且当我尝试使用虚拟html运行此代码时,它是否不适用于url?
我没有在https://www.amazon.com上看到productDetailsTable
我必须做https://www.amazon.com/才能接收html数据。
这是我经过稍微修改的Python 3代码。
from bs4 import BeautifulSoup
import requests
url = input("Enter a website to extract the URL's from: ")
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
print(soup.text)
它为页面打印html。
您会注意到,亚马逊很聪明。 html包含漫游器检查:
if (true === true) {
var ue_t0 = (+ new Date()),
ue_csm = window,
ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
ue_furl = "fls-na.amazon.com",
ue_mid = "ATVPDKIKX0DER",
ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
ue_sn = "opfcaptcha.amazon.com",
ue_id = 'R8D7EEN5FVS7RWC2M549';
}
Enter the characters you see below
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.