I started programming in Python. As my first project I want to create web crawler using Scrapy - Python module. I come across problem with which I have been struggling since 2 days and can't find solution for it. Any help with it would be appreciated.
I would like to crawl and scrap data about price of cars from Allegro - the Polish ebay. First phase of my project is to download list of car brands together with subcategories (I want to go as deep into subcategories as it is possible) and number of offers.
I start my crawling from site: http://allegro.pl/osobowe-pozostale-4058 where I can click through categories on the left panel. And so far I'm focusing only on data in this left panel.
As a result I would like to receive json file with structure:
{
{"name": "BMW" # name
"url": "http://allegro.pl/osobowe-bmw-4032" # link to subcategories
"count": 12726 # numbers of offer
"subcategories":[
{ "name": "Seria 1" # name
"url": "http://allegro.pl/bmw-seria-1-12435" # link to subcategories
“count": 832 # numbers of offer
}
,
{another BMW model}
,
…
]
}
,
{another car brand }
,
…
}
Because some brands have no sub categories and some have sub categories of subcategories, Web clawler must be quite flexible. Sometimes should stop at main page and sometimes go deeper and stop at dead end sub category.
BMV ->Seria 1 -> E87 (2004-2013) vs Acura (only 2 offers and no subcategories)
So far I was able to create first spider it looks like this
Items.py
import scrapy
class Allegro3Item(scrapy.Item):
name=scrapy.Field()
count=scrapy.Field()
url = scrapy.Field()
subcategory= scrapy.Field()
spider:
import scrapy
from allegro3.items import Allegro3Item
linki=[]
class AlegroSpider(scrapy.Spider):
name = "AlegroSpider"
allowed_domains = ["allegro.pl"]
start_urls = ["http://allegro.pl/samochody-osobowe-4029"]
def parse(self, response):
global linki
if response.url not in linki:
linki.append(response.url)
for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):
la = Allegro3Item()
link = de.xpath('a/@href').extract()
la['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
la['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
la['url'] = response.urljoin(link[0]).encode('utf-8')
la['subcategory']=[]
if la['url'] is not None:
if la['url'] not in linki:
linki.append(la['url'])
request = scrapy.Request(la['url'],callback=self.SearchFurther)
#la['subcategory'].append(request.meta['la2'])
yield la
def SearchFurther(self,response):
global linki
for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):
link = de.xpath('a/@href').extract()
la2 = Allegro3Item()
la2['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
la2['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
la2['url'] = response.urljoin(link[0]).encode('utf-8')
yield la2
In this code I'm trying to create class/iteam with:
I have problems with 4. When I create additionl request 'SearchFurther'.
request = scrapy.Request(la['url'],callback=self.SearchFurther)
I don't know how to pass la2 item which is the result of SearchFurther to previous request, so I could append la2 to a the la[subcategory] as an additional element of the list (one brand can have many subcategories).
I would be grateful for any help.
Have a look at this documentation: http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.