简体   繁体   English

美丽的汤不返回HTML文件中的所有内容吗?

[英]Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me. HTML noob在这里,所以我可能会误解有关HTML文档的内容,所以请多多包涵。

I'm using Beautiful Soup to parse web data in Python. 我正在使用Beautiful Soup在Python中解析网络数据。 Here is my code: 这是我的代码:

import urllib
import BeautifulSoup

url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone

now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p> , (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty. 现在,如果您查看该网站,则HTML代码中的<p class="nbaLiveStatTxSm"> FINAL </p> (检查第一个ATL-WAS游戏容器左侧的“最终”文本)但是,当我运行上面的代码时,我的代码没有返回网页上显示的'FINAL',而是nbaLiveStatTxSm类为空。

On my machine, this is the output when I print indicateGameDone : 在我的机器上,这是我打印indicateGameDone时的输出:

<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>

Does anyone know why this is happening? 有人知道为什么会这样吗?

EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python. 编辑:澄清:问题不是检索标记中的文本,问题是当我从网站上获取html代码并以python打印出来时,我在检查Web上的元素时看到的东西不是在Python的print语句中。

You can use this logic to extract any text. 您可以使用此逻辑提取任何文本。 This code allows you to extract any data between any tags. 此代码使您可以提取任何标签之间的任何数据。 Output - FINAL 输出-FINAL

import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
    p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
    print(p_text.getText())
    break;

It looks like your problem is not with BeautifulSoup but instead with urllib. 看来您的问题不在于BeautifulSoup,而在于urllib。

Try running the following commands 尝试运行以下命令

>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230

Which is no surprise considering that Beautiful Soup was able to find the div itself. 考虑到Beautiful Soup能够找到div本身,这不足为奇。 However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running 但是,当我们深入了解urllib实际收集的内容时,可以看到<p class="nbaFnlStatTxSm">

>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><a href="/leaguepass"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></a></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum  win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '

You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself. 您会看到标签为空,因此问题出在传递给Beautiful Soup的数据,而不是包装本身。

  1. changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup 将Beautifulsoup的导入更改为当前版本的BeautifulSoup的正确语法
  2. corrected the way you were constructing the BeautifulSoup object 更正了您构造BeautifulSoup对象的方式
  3. fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after. 修复了您的find语句,然后使用.text命令来获取您想要的HTML中文本的字符串表示形式。

With some minor modifications to your code as listed above, your code runs for me. 对上面列出的代码进行一些小的修改,您的代码就会为我运行。

import urllib
from bs4 import BeautifulSoup

url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "

to address comments: 解决评论:

import urllib
from bs4 import BeautifulSoup

url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM