[英]Python BeautifulSoup html.parser not working
I have a script to pull off book information from Amazon which was running successfully before but failed today. 我有一个脚本可以从Amazon处获取图书信息,该脚本以前曾成功运行,但今天却失败了。 I am not able to figure out exactly what is going wrong but I am assuming its the parser or Javascript related.
我无法弄清楚到底出了什么问题,但我假设它与解析器或Javascript有关。 I am using the below code.
我正在使用以下代码。
from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=9780307397980',headers={'User-Agent': b'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'})
html = response.content
soup = BeautifulSoup(html, "html.parser")
resultcol = soup.find('div', attrs={'id':'resultsCol'})
Previously I used to get data in resultcol
but now its blank. 以前我曾经在
resultcol
获取数据,但是现在它空白。 When I check html
I see the tag i am looking for ie <div id="resultsCol" class=\\'\\' >
. 当我检查
html
我看到了我正在寻找的标签,即<div id="resultsCol" class=\\'\\' >
。 But soup
does not have this text in it. 但是
soup
没有这段文字。 Can anyone help me debug this? 谁能帮我调试一下吗? It was working perfectly fine before but now it is not.
之前它工作得很好,但现在不是。
You need to wait until the page is completely loaded. 您需要等待页面完全加载完毕。 You have to use
phantomJs
to make sure page is loaded correctly. 您必须使用
phantomJs
来确保正确加载页面。
I was able to get the correct element with following code. 我可以使用以下代码获取正确的元素。
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = ("https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3D"
"stripbooks&field-keywords=9780307397980")
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
resultcol = soup.find('img', attrs={'class': 's-access-image'})
print resultcol
Remove headers, and it should work. 删除标题,它应该可以工作。
from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search- alias%3Dstripbooks&field-keywords=9780307397980')
html = response.content
soup = BeautifulSoup(html, "html.parser")
resultcol = soup.find('div', attrs={'id':'resultsCol'})`
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.