[英]Scrapy item saves only the last element in a loop
I'm using Scrapy library to crawl data from a website. 我正在使用Scrapy库从网站爬网数据。
I get the result from crawling an website and I want to save it to database. 我从爬取网站得到结果,我想将其保存到数据库。 I use Scrapy item and pipeline for that. 我为此使用了Scrapy项目和管道。
I get an list, thus I need to use for
loop to save the item
. 我得到一个列表,因此需要使用for
循环保存该item
。 But the problem is that the only last item in a list gets saved. 但是问题在于,列表中唯一的最后一项被保存了。
My code is as follows: 我的代码如下:
def parse(self, response):
vehicles = []
total_results = response.css('.cl-filters-summary-counter::text').extract_first().replace('.', '')
reference_urls = []
for url in response.css('.cldt-summary-titles'):
reference_url = url.css("a::attr(href)").extract_first().strip(' \t\n\r')
reference_urls.append(reference_url)
ids = []
for item in response.css('.cldt-summary-full-item'):
car_id = item.css("::attr(id)").extract_first().strip(' \t\n\rli-')
ids.append(car_id)
for item in response.css('.cldt-price'):
dirty_price = item.css("::text").extract_first().strip(' \t\n\r')
comma = dirty_price.index(",-")
price = dirty_price[2:comma].replace('.', '')
prices.append(price)
for item in zip(ids, reference_urls, prices):
car = CarItem()
car['reference'] = item[0]
car['reference_url'] = item[1]
car['data'] = ""
car['price'] = item[2]
return car
The result that I get from crawling is good. 我从爬取中得到的结果很好。 If I in for
loop do something as follows: 如果我在for
循环中执行以下操作:
vehicles = []
for item in zip(ids, reference_urls, prices):
scraped_info = {
"reference": item[0],
"reference_url": item[1],
"price": item[2]
}
vehicles.append(scraped_info)
And if I print vehicles
I get the right result: 如果我打印vehicles
我将得到正确的结果:
[
{
"price": "4250",
"reference": "6784086e-1afb-216d-e053-e250040a033f",
"reference_url": "some-link-1"
},
{
"price": "4250",
"reference": "c05595ac-e49e-4b71-a436-868c192ef82c",
"reference_url": "some-link-2"
},
{
"price": "4900",
"reference": "444553f2-e8fd-41c9-9244-182668544e2a",
"reference_url": "some-link-3"
}
]
UPDATE 更新
CarItem
is just a scrapy item in items.py
CarItem
只是items.py中的items.py
class CarItem(scrapy.Item):
# define the fields for your item here like:
reference = scrapy.Field()
reference_url = scrapy.Field()
data = scrapy.Field()
price = scrapy.Field()
Any idea what I do wrong? 知道我做错了什么吗?
According to Scrapy Document , the method parse
根据Scrapy Document ,该方法parse
, as well as any other Request callback, must return an iterable of Request and/or dicts or Item objects. 以及任何其他Request回调,都必须返回Request和/或dict或Item对象的可迭代对象。
Also according to the code example below that link, 同样根据该链接下方的代码示例,
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request('http://www.example.com/1.html', self.parse)
yield scrapy.Request('http://www.example.com/2.html', self.parse)
yield scrapy.Request('http://www.example.com/3.html', self.parse)
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield MyItem(title=h3)
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
We can see we have to use yield
to acquire proper results from parse
function. 我们可以看到我们必须使用yield
从parse
函数中获取适当的结果。
tl;dr : replace return
in your last line with yield
. tl; dr :将最后一行的return
替换为yield
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.