卡住刮刮特定表

Question

因此，我要抓取的表格可以在以下位置找到： http : //www.betdistrict.com/tipsters

我在標題為“ Jun Stats”的表格之后。

這是我的蜘蛛：

from __future__ import division
from decimal import *

import scrapy
import urlparse

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider):
name = "betdistrict"
allowed_domains = ["betdistrict.com"]
start_urls = ["http://www.betdistrict.com/tipsters"]

def parse(self, response):
    for sel in response.xpath('//table[1]/tr'):
        item = TtscrapeItem()
        name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]
        url = sel.xpath('td[@class="tipst"]/a/@href').extract()[0]
        tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'
        item['Tipster'] = tipster
        won = sel.xpath('td[2]/text()').extract()[0]
        lost = sel.xpath('td[3]/text()').extract()[0]
        void = sel.xpath('td[4]/text()').extract()[0]
        tips = int(won) + int(void) + int(lost)
        item['Tips'] = tips
        strike = Decimal(int(won) / tips) * 100
        strike = str(round(strike,2))
        item['Strike'] = [strike + "%"]
        profit = sel.xpath('//td[5]/text()').extract()[0]
        if profit[0] in ['+']:
            profit = profit[1:]
        item['Profit'] = profit
        yield_str = sel.xpath('//td[6]/text()').extract()[0]
        yield_str = yield_str.replace(' ','')
        if yield_str[0] in ['+']:
            yield_str = yield_str[1:]
        item['Yield'] = '<span style="color: #40AA40">' + yield_str + '%</span>'
        item['Site'] = 'Bet District'
        yield item

這給了我第一個變量（名稱）超出范圍錯誤的列表索引。

但是，當我重寫以//開頭的xpath選擇器時，例如：

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

蜘蛛跑了，但是一遍又一遍地刮擦了第一手翻糖。

我認為這與表中沒有thead的表有關，而是在tbody的第一個tr內包含th標簽。

任何幫助深表感謝。

- - - - - 編輯 - - - - -

針對拉斯的建議：

我嘗試使用您的建議，但仍然得到超出范圍錯誤的列表：

from __future__ import division
from decimal import *

import scrapy
import urlparse

from ttscrape.items import TtscrapeItem 

class BetdistrictSpider(scrapy.Spider):
    name = "betdistrict"
    allowed_domains = ["betdistrict.com"]
    start_urls = ["http://www.betdistrict.com/tipsters"]

def parse(self, response):
    for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):
        item = TtscrapeItem()
        name = sel.xpath('a/text()').extract()[0]
        url = sel.xpath('a/@href').extract()[0]
        tipster = '<a href="' + url + '" target="_blank" rel="nofollow">' + name + '</a>'
        item['Tipster'] = tipster
        yield item

另外，我假設通過這種方式進行操作，因為不是所有單元格都具有相同的類，所以需要多個for循環？

我也嘗試過在沒有for循環的情況下進行操作，但是在這種情況下，它只會再次刮擦第一個提示器多次：

謝謝

Answer 1

當你說

name = sel.xpath('td[@class="tipst"]/a/text()').extract()[0]

XPath表達式開頭td等是相對於您在變量具有上下文節點sel （即， tr在該組的元件tr的元件，該for循環迭代結束）。

但是當你說

name = sel.xpath('//td[@class="tipst"]/a/text()').extract()[0]

XPath表達式以//td開頭，即選擇文檔中任何位置的所有td元素； 這與sel無關，因此在for循環的每次迭代中結果都是相同的。 這就是為什么它會一遍又一遍地刮掉第一個推手的原因。

為什么第一個XPath表達式失敗，並且列表索引超出范圍錯誤？ 嘗試一次將XPath表達式放在一個位置上，打印出結果，您很快就會發現問題。 在這種情況下，似乎是因為table[1]的第一個tr子代沒有td子代（只有th ）。 因此， xpath()選擇任何內容， extract()返回一個空列表，並且您嘗試引用該空列表中的第一項，從而給出列表索引超出范圍錯誤。

要解決此問題，可以將for循環XPath表達式更改為僅在具有td子代的tr元素上循環：

for sel in response.xpath('//table[1]/tr[td]'):

您可能會更喜歡，需要正確等級的td ：

for sel in response.xpath('//table[1]/tr[td[@class="tipst"]]'):

卡住刮刮特定表

問題描述

1 個解決方案

解決方案1
3 已采納 2015-06-10 16:25:38

卡住刮刮特定表

問題描述

1 個解決方案

解決方案1 3 已采納 2015-06-10 16:25:38

解決方案1
3 已采納 2015-06-10 16:25:38