简体   繁体   English

Python BeautifulSoup html.parser无法正常工作

[英]Python BeautifulSoup html.parser not working

I have a script to pull off book information from Amazon which was running successfully before but failed today. 我有一个脚本可以从Amazon处获取图书信息,该脚本以前曾成功运行,但今天却失败了。 I am not able to figure out exactly what is going wrong but I am assuming its the parser or Javascript related. 我无法弄清楚到底出了什么问题,但我假设它与解析器或Javascript有关。 I am using the below code. 我正在使用以下代码。

from bs4 import BeautifulSoup
import requests

response = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=9780307397980',headers={'User-Agent': b'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'})
html = response.content
soup = BeautifulSoup(html, "html.parser")
resultcol = soup.find('div', attrs={'id':'resultsCol'})

Previously I used to get data in resultcol but now its blank. 以前我曾经在resultcol获取数据,但是现在它空白。 When I check html I see the tag i am looking for ie <div id="resultsCol" class=\\'\\' > . 当我检查html我看到了我正在寻找的标签,即<div id="resultsCol" class=\\'\\' > But soup does not have this text in it. 但是soup没有这段文字。 Can anyone help me debug this? 谁能帮我调试一下吗? It was working perfectly fine before but now it is not. 之前它工作得很好,但现在不是。

You need to wait until the page is completely loaded. 您需要等待页面完全加载完毕。 You have to use phantomJs to make sure page is loaded correctly. 您必须使用phantomJs来确保正确加载页面。

I was able to get the correct element with following code. 我可以使用以下代码获取正确的元素。

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

url = ("https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3D"
       "stripbooks&field-keywords=9780307397980")

browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
resultcol = soup.find('img', attrs={'class': 's-access-image'})
print resultcol

Remove headers, and it should work. 删除标题,它应该可以工作。

from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-    alias%3Dstripbooks&field-keywords=9780307397980')
html = response.content
soup = BeautifulSoup(html, "html.parser")
resultcol = soup.find('div', attrs={'id':'resultsCol'})`

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM