简体   繁体   English

如何将scrapy spider的项目追加到列表中?

[英]How to append items from scrapy spider to list?

I'm using a basic spider that gets particular information from links on a website. 我正在使用一个基本的蜘蛛,它可以从网站上的链接中获取特定信息。 My code looks like this: 我的代码如下所示:

import sys
from scrapy import Request
import urllib.parse as urlparse
from properties import PropertiesItem, ItemLoader
from scrapy.crawler import CrawlerProcess    

class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["web"]
    start_urls = ['www.example.com']
    objectList = []
    def parse(self, response):
        # Get item URLs and yield Requests
        item_selector = response.xpath('//*[@class="example"]//@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin(response.url, url), callback=self.parse_item, dont_filter=True)

    def parse_item(self, response):
        L = ItemLoader(item=PropertiesItem(), response=response)
        L.add_xpath('title', '//*[@class="example"]/text()')
        L.add_xpath('adress', '//*[@class="example"]/text()')
        return L.load_item()

process = CrawlerProcess()
process.crawl(BasicSpider)
process.start()

What I want now is to append every class instance "L" to a list called objectList. 我现在想要的是将每个类实例“ L”附加到名为objectList的列表中。 I've tried do to so by altering the code like: 我尝试通过更改代码来做到这一点:

    def parse_item(self, response):
        global objectList
        l = ItemLoader(item=PropertiesItem(), response=response)
        l.add_xpath('title', '//*[@class="restaurantSummary-name"]/text()')
        l.add_xpath('adress', '//*[@class="restaurantSummary-address"]/text()')
        item = l.load_item()
        objectList.append([item.title, item.adress])
        return objectList       

But when I run this code I get a message saying: 但是,当我运行此代码时,我收到一条消息:

l = ItemLoader(item=PropertiesItem(), response=response)
NameError: name 'PropertiesItem' is not defined

Q: How do I append every item that the scraper finds to the list objectList? 问:如何将刮板找到的每个项目附加到列表objectList?

EDIT: 编辑:

I want to store the results in a list, because I can then save the results like this: 我想将结果存储在列表中,因为然后可以像这样保存结果:

import pandas as pd
table = pd.DataFrame(objectList)   
writer = pd.ExcelWriter('DataAll.xlsx')
table.to_excel(writer, 'sheet 1')
writer.save()

To save results you should use scrapy's Feed Exporters feature as described in the documentation here 要保存结果,你应该使用的文档中描述scrapy的饲料出口商功能在这里

One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating an “export file” with the scraped data (commonly called “export feed”) to be consumed by other systems. 实施抓取工具时最常需要的功能之一就是能够正确存储抓取的数据,这通常意味着生成带有抓取数据的“导出文件”(通常称为“导出提要”),以供其他系统使用。

Scrapy provides this functionality out of the box with the Feed Exports, which allows you to generate a feed with the scraped items, using multiple serialization formats and storage backends. Scrapy通过Feed导出提供了开箱即用的功能,它允许您使用多种序列化格式和存储后端生成包含抓取的项目的Feed。

See the csv section for your case. 请参阅您的案例的csv部分

Another, more custom, approach would be using scrapy's Item Pipelines . 另一种更自定义的方法是使用scrapy的Item Pipelines There's an example of simple json writer here that could be easily modified to output csv or any other format. 有简单的JSON作家的例子在这里 ,可以很容易地修改输出CSV或任何其他形式。

For example this piece of code would output all items to an test.csv file in project directory: 例如,这段代码会将所有项目输出到项目目录中的test.csv文件:

import scrapy
class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    # this is equivalent to what you would set in settings.py file
    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'test.csv'
    }
    start_urls = ['http://stackoverflow.com/questions/tagged/scrapy']

    def parse(self, response):
        titles = response.xpath("//a[@class='question-hyperlink']/text()").extract()
        for i, title in enumerate(titles):
            yield {'index': i, 'title': title}

This example generates 50 row long csv file. 本示例生成50行长的csv文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM