简体   繁体   English

美丽汤输出错误

[英]Errors with Beautiful Soup output

I'm trying to scrape data from a webpage on gamespot using beautifulsoup . 我正在尝试使用beautifulsoupgamepot的网页上抓取数据。 However, the result is very different than what I get from the page source viewer . 但是,结果与我从page source viewer获得的结果非常不同。 First off, alot of errors are produced. 首先,会产生很多errors For instance, we have 例如,我们有

r = requests.get(link) 

soup = bs4.BeautifulSoup(r.text)

And yet soup.title gives 然而soup.title

<title>404: Not Found - GameSpot</title> . <title>404: Not Found - GameSpot</title>

The data I actually want to scrape does not even appear. 我实际上要抓取的数据甚至没有出现。 Is it because the webpage contains javascript alongside ? 是因为网页在旁边包含javascript吗? If so how can I get around this ? 如果是这样,我该如何解决?

You're only sending a HTTP request to the server. 您仅向服务器发送HTTP请求。 You need to process Javascript to get the content. 您需要处理Javascript才能获取内容。

A headless browser with Javascript support, like Ghost, it'd be a good choice. 像Ghost这样的具有Javascript支持的无头浏览器将是一个不错的选择。

from ghost import Ghost

ghost = Ghost()

ghost.open(link)
page, resources = ghost.evaluate('document.documentElement.innerHTML;')
soup = BeautifulSoup(page)

.evaluate('document.documentElement.innerHTML') will show the dynamically generated content, not the static you'd see taking a look at the source. .evaluate('document.documentElement.innerHTML')将显示动态生成的内容,而不是查看源代码时看到的静态内容。

您的连接错误是:socket.error:[Errno 54]由对等方重置连接第一次连接到http://www.gamespot.com时,您必须捕获cookie并将其用于响应头中的其他页面。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM