简体   繁体   中英

Scrapy item saves only the last element in a loop

I'm using Scrapy library to crawl data from a website.

I get the result from crawling an website and I want to save it to database. I use Scrapy item and pipeline for that.

I get an list, thus I need to use for loop to save the item . But the problem is that the only last item in a list gets saved.

My code is as follows:

def parse(self, response):
    vehicles = []
    total_results = response.css('.cl-filters-summary-counter::text').extract_first().replace('.', '')

    reference_urls = []
    for url in response.css('.cldt-summary-titles'):
        reference_url = url.css("a::attr(href)").extract_first().strip(' \t\n\r')
        reference_urls.append(reference_url)

    ids = []
    for item in response.css('.cldt-summary-full-item'):
        car_id = item.css("::attr(id)").extract_first().strip(' \t\n\rli-')
        ids.append(car_id)

    for item in response.css('.cldt-price'):
        dirty_price = item.css("::text").extract_first().strip(' \t\n\r')
        comma = dirty_price.index(",-")
        price = dirty_price[2:comma].replace('.', '')
        prices.append(price)

    for item in zip(ids, reference_urls, prices):
        car = CarItem()
        car['reference'] = item[0]
        car['reference_url'] = item[1]
        car['data'] = ""
        car['price'] = item[2]
        return car

The result that I get from crawling is good. If I in for loop do something as follows:

vehicles = []
for item in zip(ids, reference_urls, prices):
     scraped_info = {
         "reference": item[0],
         "reference_url": item[1],
         "price": item[2]
     }
     vehicles.append(scraped_info)

And if I print vehicles I get the right result:

[
    {
        "price": "4250",
        "reference": "6784086e-1afb-216d-e053-e250040a033f",
        "reference_url": "some-link-1"
    },
    {
        "price": "4250",
        "reference": "c05595ac-e49e-4b71-a436-868c192ef82c",
        "reference_url": "some-link-2"
    },
    {
        "price": "4900",
        "reference": "444553f2-e8fd-41c9-9244-182668544e2a",
        "reference_url": "some-link-3"
    }
]

UPDATE

CarItem is just a scrapy item in items.py

class CarItem(scrapy.Item):
    # define the fields for your item here like:
    reference = scrapy.Field()
    reference_url = scrapy.Field()
    data = scrapy.Field()
    price = scrapy.Field()

Any idea what I do wrong?

According to Scrapy Document , the method parse

, as well as any other Request callback, must return an iterable of Request and/or dicts or Item objects.

Also according to the code example below that link,

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

We can see we have to use yield to acquire proper results from parse function.

tl;dr : replace return in your last line with yield .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM