Scrapy项目仅保存循环中的最后一个元素

Question

I'm using Scrapy library to crawl data from a website. 我正在使用Scrapy库从网站爬网数据。

I get the result from crawling an website and I want to save it to database. 我从爬取网站得到结果，我想将其保存到数据库。 I use Scrapy item and pipeline for that. 我为此使用了Scrapy项目和管道。

I get an list, thus I need to use for loop to save the item . 我得到一个列表，因此需要使用for循环保存该item 。 But the problem is that the only last item in a list gets saved. 但是问题在于，列表中唯一的最后一项被保存了。

My code is as follows: 我的代码如下：

def parse(self, response):
    vehicles = []
    total_results = response.css('.cl-filters-summary-counter::text').extract_first().replace('.', '')

    reference_urls = []
    for url in response.css('.cldt-summary-titles'):
        reference_url = url.css("a::attr(href)").extract_first().strip(' \t\n\r')
        reference_urls.append(reference_url)

    ids = []
    for item in response.css('.cldt-summary-full-item'):
        car_id = item.css("::attr(id)").extract_first().strip(' \t\n\rli-')
        ids.append(car_id)

    for item in response.css('.cldt-price'):
        dirty_price = item.css("::text").extract_first().strip(' \t\n\r')
        comma = dirty_price.index(",-")
        price = dirty_price[2:comma].replace('.', '')
        prices.append(price)

    for item in zip(ids, reference_urls, prices):
        car = CarItem()
        car['reference'] = item[0]
        car['reference_url'] = item[1]
        car['data'] = ""
        car['price'] = item[2]
        return car

The result that I get from crawling is good. 我从爬取中得到的结果很好。 If I in for loop do something as follows: 如果我在for循环中执行以下操作：

vehicles = []
for item in zip(ids, reference_urls, prices):
     scraped_info = {
         "reference": item[0],
         "reference_url": item[1],
         "price": item[2]
     }
     vehicles.append(scraped_info)

And if I print vehicles I get the right result: 如果我打印vehicles我将得到正确的结果：

[
    {
        "price": "4250",
        "reference": "6784086e-1afb-216d-e053-e250040a033f",
        "reference_url": "some-link-1"
    },
    {
        "price": "4250",
        "reference": "c05595ac-e49e-4b71-a436-868c192ef82c",
        "reference_url": "some-link-2"
    },
    {
        "price": "4900",
        "reference": "444553f2-e8fd-41c9-9244-182668544e2a",
        "reference_url": "some-link-3"
    }
]

UPDATE 更新

CarItem is just a scrapy item in items.py CarItem只是items.py中的items.py

class CarItem(scrapy.Item):
    # define the fields for your item here like:
    reference = scrapy.Field()
    reference_url = scrapy.Field()
    data = scrapy.Field()
    price = scrapy.Field()

Any idea what I do wrong? 知道我做错了什么吗？

Answer 1

According to Scrapy Document , the method parse 根据Scrapy Document ，该方法parse

, as well as any other Request callback, must return an iterable of Request and/or dicts or Item objects. 以及任何其他Request回调，都必须返回Request和/或dict或Item对象的可迭代对象。

Also according to the code example below that link, 同样根据该链接下方的代码示例，

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

We can see we have to use yield to acquire proper results from parse function. 我们可以看到我们必须使用yield从parse函数中获取适当的结果。

tl;dr : replace return in your last line with yield . tl; dr ：将最后一行的return替换为yield 。

Scrapy项目仅保存循环中的最后一个元素

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-06-08 04:31:37

Scrapy项目仅保存循环中的最后一个元素

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-06-08 04:31:37

解决方案1
0 已采纳 2018-06-08 04:31:37