简体   繁体   中英

How to append items from scrapy spider to list?

I'm using a basic spider that gets particular information from links on a website. My code looks like this:

import sys
from scrapy import Request
import urllib.parse as urlparse
from properties import PropertiesItem, ItemLoader
from scrapy.crawler import CrawlerProcess    

class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["web"]
    start_urls = ['www.example.com']
    objectList = []
    def parse(self, response):
        # Get item URLs and yield Requests
        item_selector = response.xpath('//*[@class="example"]//@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin(response.url, url), callback=self.parse_item, dont_filter=True)

    def parse_item(self, response):
        L = ItemLoader(item=PropertiesItem(), response=response)
        L.add_xpath('title', '//*[@class="example"]/text()')
        L.add_xpath('adress', '//*[@class="example"]/text()')
        return L.load_item()

process = CrawlerProcess()
process.crawl(BasicSpider)
process.start()

What I want now is to append every class instance "L" to a list called objectList. I've tried do to so by altering the code like:

    def parse_item(self, response):
        global objectList
        l = ItemLoader(item=PropertiesItem(), response=response)
        l.add_xpath('title', '//*[@class="restaurantSummary-name"]/text()')
        l.add_xpath('adress', '//*[@class="restaurantSummary-address"]/text()')
        item = l.load_item()
        objectList.append([item.title, item.adress])
        return objectList       

But when I run this code I get a message saying:

l = ItemLoader(item=PropertiesItem(), response=response)
NameError: name 'PropertiesItem' is not defined

Q: How do I append every item that the scraper finds to the list objectList?

EDIT:

I want to store the results in a list, because I can then save the results like this:

import pandas as pd
table = pd.DataFrame(objectList)   
writer = pd.ExcelWriter('DataAll.xlsx')
table.to_excel(writer, 'sheet 1')
writer.save()

To save results you should use scrapy's Feed Exporters feature as described in the documentation here

One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating an “export file” with the scraped data (commonly called “export feed”) to be consumed by other systems.

Scrapy provides this functionality out of the box with the Feed Exports, which allows you to generate a feed with the scraped items, using multiple serialization formats and storage backends.

See the csv section for your case.

Another, more custom, approach would be using scrapy's Item Pipelines . There's an example of simple json writer here that could be easily modified to output csv or any other format.

For example this piece of code would output all items to an test.csv file in project directory:

import scrapy
class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    # this is equivalent to what you would set in settings.py file
    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'test.csv'
    }
    start_urls = ['http://stackoverflow.com/questions/tagged/scrapy']

    def parse(self, response):
        titles = response.xpath("//a[@class='question-hyperlink']/text()").extract()
        for i, title in enumerate(titles):
            yield {'index': i, 'title': title}

This example generates 50 row long csv file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM