Scrapy蜘蛛似乎无法找到下一页的xpath

Question

我的蜘蛛可以在第一页中抓取我想要的任何内容，但是当它尝试查找下一页的 xpath 时，我收到索引超出范围的错误。 我在 shell 中测试过，xpath 看起来不错，所以现在我不知道该怎么做。

rom scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from lrrytas.items import LrrytasItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class LrrytasSpider(Spider):
    name = "lrrytas"
    allowed_domains = ['http://www.lrytas.lt/']
    start_urls = ["http://www.lrytas.lt/?id=14355922181434706286&view=6"]
    rules = (
       Rule(LinkExtractor(allow=r'Items'), callback='parse_item', follow=True),
       Rule(LinkExtractor(restrict_xpaths=('//*[@class="comment-box-head"]/*')), callback='parse_comments_follow_next_page', follow=True)
)
    def parse(self, response):
     sel = Selector(response)
     site = sel.xpath('//*[@class="comment"]/*')
     node = sel.xpath('//*[@class="comments"]/*')

     for i in range(0, len(site), 2):
       item = LrrytasItem()
       item['name'] = node[i].xpath('*/div[contains(@class, "comment-nr")]/text()').extract()[0]
       item['ip'] = node[i].xpath('*/*/div[contains(@class, "comment-ip")]/text()').extract()[0]
       item['time'] = node[i].xpath('*/*/div[contains(@class, "comment-time")]/text()').extract()[0]
       item ['comment'] = site[i + 1].xpath('descendant-or-self::text()').extract()[0]
       yield item

    def parse_comments_follow_next_page(self, response):
        next_page = xpath('//*[contains(text(), "Kitas >>") and contains(@href, "id")]/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield Request(url, self.parse)

编辑：我使用len()使循环更加自动化，然后手动

Answer 1

您的CrawlSpider规则和 next_page 检查的XPath在我看来似乎不太合适。 所以我想建议使用一个简单的Spider并手动处理下一页请求。 我已经编译了一些代码来展示如何做到这一点：

import scrapy

class Comment(scrapy.Item):
    name = scrapy.Field()
    ip = scrapy.Field()
    time = scrapy.Field()

class MySpider(scrapy.Spider):

    name = 'lrytas'
    allowed_domains = ['www.lrytas.lt']
    start_urls = ['http://www.lrytas.lt/?id=14355922181434706286&view=6']

    def parse(self, response):

        xpath_comments = '//div[@class="comments"]/div[@class="comment"]'
        sel_comments = response.xpath(xpath_comments)
        for sel in sel_comments:
            item = Comment()
            item['name'] = ' '.join(sel.xpath('.//div[@class="comment-nr"]//text()').extract())
            item['time'] = ' '.join(sel.xpath('.//div[@class="comment-time"]//text()').extract())
            # Other item fields go here ...
            yield item

        # Check if there is a next page link ...
        xpath_NextPage = './/a[contains(.,"Kitas >>")][1]/@href' # Take on of the two links
        if response.xpath(xpath_NextPage):
            # If YES: Create and submit request
            url_NextPage = 'http://www.lrytas.lt' + response.xpath(xpath_NextPage).extract()[0]
            request = scrapy.Request(url_NextPage, callback=self.parse)
            yield request

Scrapy蜘蛛似乎无法找到下一页的xpath

问题描述

1 个解决方案

解决方案1
1 2015-07-06 12:31:24

Scrapy蜘蛛似乎无法找到下一页的xpath

问题描述

1 个解决方案

解决方案1 1 2015-07-06 12:31:24

解决方案1
1 2015-07-06 12:31:24