BeautifulSoup返回不正確的文本

Question

我正努力在下面的網站上抓取現場網球比賽成績。 當比賽結束時，我會進行抓拍更改並獲得分數，但是在比賽期間，當我搜索保留分數的相關“跨度”班級時，我返回班級，但成績為空白（請參見下文））

http://www.scoreboard.com/game/6LeqhPJd/#game-summary

score = score.findAll('span',attrs={'class':'scoreboard'})

輸出：

[<span class="scoreboard">-</span>, <span class="scoreboard">-</span>]

預期產量

[<span class="scoreboard">1</span>, <span class="scoreboard">0</span>]

使用Firebug，我可以在這些字段中查看得分，但是似乎無法返回它。 有誰知道為什么會這樣..？

注意：當上述URL中的匹配完成時，分數元素將更改。 這只是現場比賽的一個問題...

Answer 1

該網頁正在使用JavaScript。 如果要使用urllib下載URL，則不會執行JavaScript。 在瀏覽器中看到的許多HTML都沒有生成。

執行JavaScript的一種方法是使用Selenium 。 另一種方法是使用PyQt4 ：

import sys
from PyQt4 import QtWebKit
from PyQt4 import QtCore
from PyQt4 import QtGui

class Render(QtWebKit.QWebPage):
    def __init__(self, url):
        self.app = QtGui.QApplication(sys.argv)
        QtWebKit.QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QtCore.QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
r = Render(url)
content = unicode(r.frame.toHtml())

擁有content 后（執行JavaScript 后），您可以使用HTML解析器（例如BeautifulSoup或lxml）對其進行解析。

例如，使用lxml：

import lxml.html as LH

def clean(text):
    return text.replace(u'\xa0', u'')

doc = LH.fromstring(content)   
result = []
for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
    row = []
    for elt in tr.xpath('td'):
        row.append(clean(elt.text_content()))
    result.append(u', '.join(row[1:]))
print(u'\n'.join(result))

產量

Chardy J. (Fra), 2, 6, 77, , , , 
Zeballos H. (Arg), 0, 4, 63, , , ,

使用Selenium和PhantomJS （這樣就不會彈出GUI瀏覽器），等效代碼如下所示：

import selenium.webdriver as webdriver
import contextlib
import os
import lxml.html as LH

# define path to the phantomjs binary
phantomjs = os.path.expanduser('~/bin/phantomjs')
url = 'http://www.scoreboard.com/game/6LeqhPJd/#game-summary'
with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    content = driver.page_source
    doc = LH.fromstring(content)   
    result = []
    for tr in doc.xpath('//tr[td[@class="left summary-horizontal"]]'):
        row = []
        for elt in tr.xpath('td'):
            row.append(elt.text_content())
        result.append(u', '.join(row[1:]))
    print(u'\n'.join(result))

Selenium / PhantomJS解決方案和PyQt4解決方案的運行時間大致相同。

BeautifulSoup返回不正確的文本

問題描述

1 個解決方案

解決方案1
6 已采納 2013-05-05 16:43:02

BeautifulSoup返回不正確的文本

問題描述

1 個解決方案

解決方案1 6 已采納 2013-05-05 16:43:02

解決方案1
6 已采納 2013-05-05 16:43:02