Python BeautifulSoup html.parser无法正常工作

Question

I have a script to pull off book information from Amazon which was running successfully before but failed today. 我有一个脚本可以从Amazon处获取图书信息，该脚本以前曾成功运行，但今天却失败了。 I am not able to figure out exactly what is going wrong but I am assuming its the parser or Javascript related. 我无法弄清楚到底出了什么问题，但我假设它与解析器或Javascript有关。 I am using the below code. 我正在使用以下代码。

from bs4 import BeautifulSoup
import requests

response = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=9780307397980',headers={'User-Agent': b'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'})
html = response.content
soup = BeautifulSoup(html, "html.parser")
resultcol = soup.find('div', attrs={'id':'resultsCol'})

Previously I used to get data in resultcol but now its blank. 以前我曾经在resultcol获取数据，但是现在它空白。 When I check html I see the tag i am looking for ie <div id="resultsCol" class=\\'\\' > . 当我检查html我看到了我正在寻找的标签，即<div id="resultsCol" class=\\'\\' > 。 But soup does not have this text in it. 但是soup没有这段文字。 Can anyone help me debug this? 谁能帮我调试一下吗？ It was working perfectly fine before but now it is not. 之前它工作得很好，但现在不是。

Answer 1

You need to wait until the page is completely loaded. 您需要等待页面完全加载完毕。 You have to use phantomJs to make sure page is loaded correctly. 您必须使用phantomJs来确保正确加载页面。

I was able to get the correct element with following code. 我可以使用以下代码获取正确的元素。

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

url = ("https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3D"
       "stripbooks&field-keywords=9780307397980")

browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
resultcol = soup.find('img', attrs={'class': 's-access-image'})
print resultcol

Answer 2

Remove headers, and it should work. 删除标题，它应该可以工作。

from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-    alias%3Dstripbooks&field-keywords=9780307397980')
html = response.content
soup = BeautifulSoup(html, "html.parser")
resultcol = soup.find('div', attrs={'id':'resultsCol'})`

Python BeautifulSoup html.parser无法正常工作

问题描述

2 个解决方案

解决方案1
1 2018-09-12 23:18:59

解决方案2
0 2018-09-12 23:39:35

Python BeautifulSoup html.parser无法正常工作

问题描述

2 个解决方案

解决方案1 1 2018-09-12 23:18:59

解决方案2 0 2018-09-12 23:39:35

解决方案1
1 2018-09-12 23:18:59

解决方案2
0 2018-09-12 23:39:35