简体   繁体   中英

Recursive web crawling with Scrapy

I started programming in Python. As my first project I want to create web crawler using Scrapy - Python module. I come across problem with which I have been struggling since 2 days and can't find solution for it. Any help with it would be appreciated.

I would like to crawl and scrap data about price of cars from Allegro - the Polish ebay. First phase of my project is to download list of car brands together with subcategories (I want to go as deep into subcategories as it is possible) and number of offers.

I start my crawling from site: http://allegro.pl/osobowe-pozostale-4058 where I can click through categories on the left panel. And so far I'm focusing only on data in this left panel.

As a result I would like to receive json file with structure:

{
    {"name": "BMW" # name
    "url": "http://allegro.pl/osobowe-bmw-4032" # link to subcategories
    "count": 12726 # numbers of offer
    "subcategories":[
        {  "name": "Seria 1" # name
        "url": "http://allegro.pl/bmw-seria-1-12435" # link to subcategories
        “count": 832 # numbers of offer
        }
        ,
        {another BMW model}
        ,
        …
        ]
     }
     ,
     {another car brand }
     ,
      …
}

Because some brands have no sub categories and some have sub categories of subcategories, Web clawler must be quite flexible. Sometimes should stop at main page and sometimes go deeper and stop at dead end sub category.

BMV ->Seria 1 -> E87 (2004-2013)    vs  Acura (only 2 offers and no subcategories)

So far I was able to create first spider it looks like this

Items.py

import scrapy
class Allegro3Item(scrapy.Item):
    name=scrapy.Field()
    count=scrapy.Field()
    url = scrapy.Field()
    subcategory= scrapy.Field()

spider:

import scrapy

from allegro3.items import Allegro3Item

linki=[]

class AlegroSpider(scrapy.Spider):
    name = "AlegroSpider"
    allowed_domains = ["allegro.pl"]
    start_urls = ["http://allegro.pl/samochody-osobowe-4029"]

    def parse(self, response):

        global linki

        if response.url not in linki:
            linki.append(response.url)

            for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):

                la = Allegro3Item()
                link = de.xpath('a/@href').extract()
                la['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
                la['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
                la['url'] = response.urljoin(link[0]).encode('utf-8')
                la['subcategory']=[]


                if la['url'] is not None:
                    if la['url'] not in linki:
                        linki.append(la['url'])

                        request = scrapy.Request(la['url'],callback=self.SearchFurther) 
                        #la['subcategory'].append(request.meta['la2'])
                yield la        

    def SearchFurther(self,response):
        global linki

        for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):

            link = de.xpath('a/@href').extract()
            la2 = Allegro3Item()
            la2['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
            la2['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
            la2['url'] = response.urljoin(link[0]).encode('utf-8')

            yield la2

In this code I'm trying to create class/iteam with:

  1. name of the brand
  2. number of offers
  3. link to subcateogry
  4. list of elements of subcategory with the same data as in points 1-4

I have problems with 4. When I create additionl request 'SearchFurther'.

request = scrapy.Request(la['url'],callback=self.SearchFurther) 

I don't know how to pass la2 item which is the result of SearchFurther to previous request, so I could append la2 to a the la[subcategory] as an additional element of the list (one brand can have many subcategories).

I would be grateful for any help.

Have a look at this documentation: http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. You can use the Request.meta attribute for that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM