簡體   English   中英

XPath沒有使用迭代器指向正確的HTML表元素(Scrapy)

[英]XPath doesn't point to the right HTML table elements with iterator (Scrapy)

我在使用XPath從表中選擇帶有Scrapy的HTML元素時遇到問題。 我使用的示例是來自Scrapy網站的非常基本的示例: http ://doc.scrapy.org/en/latest/intro/tutorial.html,而我要解析的網站將是http://www.euroleague達網絡/主/結果/ showgame?gamecode = 5&gamenumber = 1&phasetypecode = RS&seasoncode = E2013#!playbyplay

最初,我使用以下代碼:

from basketbase.items import BasketbaseItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse


class Basketspider(CrawlSpider):
    name = "playbyplay"
    download_delay = 0.5

    allowed_domains = ["www.euroleague.net"]
    start_urls = ["http://www.euroleague.net/main/results/showgame?gamenumber=1&phasetypecode=RS&gamecode=4&seasoncode=E2013"]
    rules = (
        Rule(SgmlLinkExtractor(allow=(),),callback='parse_item',),        
    )  


    def parse(self,response):
        response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
        return super(Basketspider,self).parse(response)

    def parse_item(self, response):
        response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
        sel = HtmlXPathSelector(response)

        items=[]
        item = BasketbaseItem()         
        item['game_time'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[1]/text()').extract() #
        item['game_event'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[2]/text()').extract() #
        item['game_event_res_home'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[3]/text()').extract() #
        item['game_event_res_visitor'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[3]/text()').extract() #
        item['game_event_team'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[4]/text()').extract() #
        item['game_event_player'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[5]/text()').extract() #          
        items.append(item)



        return items

好吧,這很基本,規則目前還不太正確,但是此示例的主要問題是XPath。

它有效,但不是我想要的方式。 我希望每個項目每個tr僅提取一個td值,但是使用此代碼,它可以一次將所有td元素提取到該項目。 項目game_event_res_visitor:

'game_event_res_visitor': [u'0-0',
                           u'0-0',
                           u'0-0',.......(list goes on and on)

為了獲得我想要的結果,我決定使用循環(例如在Scrapy教程( http://doc.scrapy.org/en/latest/intro/tutorial.html )中),但是它沒有返回任何值所有。 這是代碼:

def parse(self,response):
    response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
    return super(Basketspider,self).parse(response)

def parse_item(self, response):
    response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
    sel = HtmlXPathSelector(response)
    sites = sel.xpath('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr')        
    items=[]
    item = BasketbaseItem()
    for site in sites:

        item = BasketbaseItem()
        item['game_time'] = sel.select('td[1]/text()').extract() #
        item['game_event'] = sel.select('td[2]/text()').extract() #
        item['game_event_res_home'] = sel.select('td[3]/text()').extract() #
        item['game_event_res_visitor'] = sel.select('td[3]/text()').extract() #
        item['game_event_team'] = sel.select('td[4]/text()').extract() #
        item['game_event_player'] = sel.select('td[5]/text()').extract() #          
        items.append(item)



    return items

和終端輸出:

2014-03-07 16:57:45+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=9&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
    {'game_event': [],
     'game_event_player': [],
     'game_event_res_home': [],
     'game_event_res_visitor': [],
     'game_event_team': [],
     'game_time': []}
2014-03-07 16:57:45+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=9&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
    {'game_event': [],
     'game_event_player': [],
     'game_event_res_home': [],
     'game_event_res_visitor': [],
     'game_event_team': [],
     'game_time': []}

我知道我的XPath出了點問題,但是我不明白是什么。 如果我在item元素中使用相對XPath,它會得到與第一個示例相同的結果。 這樣就可以了,但是我無法使用已有的代碼來實現。 我什至嘗試了“通配符”。

    item['game_time'] = sel.select('*/text()').extract() #
    item['game_event'] = sel.select('*/text()').extract() #
    item['game_event_res_home'] = sel.select('*/text()').extract() #
    item['game_event_res_visitor'] = sel.select('*/text()').extract() #
    item['game_event_team'] = sel.select('*/text()').extract() #
    item['game_event_player'] = sel.select('*/text()').extract() #  

它沒有得到任何文本結果。

2014-03-07 19:11:14+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=7&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
    {'game_event': [u' \r\n', u'\r\n'],
     'game_event_player': [u' \r\n', u'\r\n'],
     'game_event_res_home': [u' \r\n', u'\r\n'],
     'game_event_res_visitor': [u' \r\n', u'\r\n'],
     'game_event_team': [u' \r\n', u'\r\n'],
     'game_time': [u' \r\n', u'\r\n']}

我很困惑,我不明白我的XPath或代碼有什么問題。

這對我有用:

def parse_item(self, response):
    response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
    sel = HtmlXPathSelector(response)

    rows = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr')
    for row in rows:
        item = BasketbaseItem()
        item['game_time'] = row.select("td[1]/text()").extract()[0]
        item['game_event'] = row.select("td[2]/text()").extract()[0]
        result = row.select("td[3]/text()").extract()[0]
        item['game_event_res_home'], item['game_event_res_visitor'] = result.split('-')
        item['game_event_team'] = row.select("td[4]/text()").extract()[0]
        item['game_event_player'] = row.select("td[5]/text()").extract()[0]
        yield item

這是我得到的示例項目:

{'game_event': u'Steal',
 'game_event_player': u'DJEDOVIC, NIHAD',
 'game_event_res_home': u'0 ',
 'game_event_res_visitor': u' 0',
 'game_event_team': u'FC Bayern Munich',
 'game_time': u'2'}

對於您來說,這只是一個開始-有時由於IndexError異常而無法提供商品-請妥善處理。

希望能有所幫助。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM