Scrapy：使用Feed导出程序存储项目而不使用退货

Question

我想使用Scrapy从具有表和多个页面的网站中检索一些数据。 看起来像这样：

class ItsyBitsy(Spider):
    name = "itsybitsy"
    allowed_domains = ["mywebsite.com"]
    start_urls = ["http://mywebsite.com/Default.aspx"]

    def parse(self, response):
        # Performs authentication to get past the login form
        return [FormRequest.from_response(response,
            formdata={'tb_Username':'admin','tb_Password':'password'},
            callback=self.after_login,
            clickdata={'id':'b_Login'})]

    def after_login(self, response):
        # Session authenticated. Request the Subscriber List page
        yield Request("http://mywebsite.com/List.aspx", 
                  callback=self.listpage)

    def listpage(self, response):
        # Parses the entries on the page, and stores them
        sel = Selector(response)
        entries = sel.xpath("//table[@id='gv_Subsribers']").css("tr")
        items = []
        for entry in entries:
            item = Contact()
            item['name'] = entry.xpath('/td[0]/text()')
            items.append(item)

        # I want to request the next page, but store these results FIRST
        self.getNext10()
        return items

我被困在最后一行的最后。 我想请求下一页（以便可以再提取10行数据），但是我想使用Feed导出程序（在settings.py配置）首先保存数据。

如何告诉Feed导出器保存数据而不调用return items （这将阻止我继续抓取下10行）。

Answer 1

答：使用发电机。

def listpage(self, response):
    # Parses the entries on the page, and stores them
    sel = Selector(response)
    entries = sel.xpath("//table[@id='gv_Subsribers']").css("tr")
    items = []
    for entry in entries:
        item = Contact()
        item['name'] = entry.xpath('/td[0]/text()')
        yield item

    # remember get next has to return Request with callback=self.listpage
    yield self.getNext10()

Scrapy：使用Feed导出程序存储项目而不使用退货

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-05-09 15:40:34

Scrapy：使用Feed导出程序存储项目而不使用退货

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-05-09 15:40:34

解决方案1
1 已采纳 2014-05-09 15:40:34