简体   繁体   English

使用Scrapy进行递归Web爬网

[英]Recursive web crawling with Scrapy

I started programming in Python. 我开始用Python编程。 As my first project I want to create web crawler using Scrapy - Python module. 作为我的第一个项目,我想使用Scrapy-Python模块创建网络搜寻器。 I come across problem with which I have been struggling since 2 days and can't find solution for it. 我遇到了两天以来一直在苦苦挣扎的问题,找不到解决方案。 Any help with it would be appreciated. 任何帮助,将不胜感激。

I would like to crawl and scrap data about price of cars from Allegro - the Polish ebay. 我想从Allegro-波兰eBay上抓取并删除有关汽车价格的数据。 First phase of my project is to download list of car brands together with subcategories (I want to go as deep into subcategories as it is possible) and number of offers. 我的项目的第一阶段是下载汽车品牌列表以及子类别(我想尽可能深入地研究子类别)和要约数量。

I start my crawling from site: http://allegro.pl/osobowe-pozostale-4058 where I can click through categories on the left panel. 我从以下站点开始爬网: http : //allegro.pl/osobowe-pozostale-4058 ,可在其中单击左侧面板上的类别。 And so far I'm focusing only on data in this left panel. 到目前为止,我只关注左侧面板中的数据。

As a result I would like to receive json file with structure: 结果,我想接收具有以下结构的json文件:

{
    {"name": "BMW" # name
    "url": "http://allegro.pl/osobowe-bmw-4032" # link to subcategories
    "count": 12726 # numbers of offer
    "subcategories":[
        {  "name": "Seria 1" # name
        "url": "http://allegro.pl/bmw-seria-1-12435" # link to subcategories
        “count": 832 # numbers of offer
        }
        ,
        {another BMW model}
        ,
        …
        ]
     }
     ,
     {another car brand }
     ,
      …
}

Because some brands have no sub categories and some have sub categories of subcategories, Web clawler must be quite flexible. 由于某些品牌没有子类别,而有些品牌却有子类别的子类别,因此Web clawler必须非常灵活。 Sometimes should stop at main page and sometimes go deeper and stop at dead end sub category. 有时应该停在主页上,有时更深一些,停在死胡同的子类别中。

BMV ->Seria 1 -> E87 (2004-2013)    vs  Acura (only 2 offers and no subcategories)

So far I was able to create first spider it looks like this 到目前为止,我已经能够创建第一个蜘蛛,看起来像这样

Items.py Items.py

import scrapy
class Allegro3Item(scrapy.Item):
    name=scrapy.Field()
    count=scrapy.Field()
    url = scrapy.Field()
    subcategory= scrapy.Field()

spider: 蜘蛛:

import scrapy

from allegro3.items import Allegro3Item

linki=[]

class AlegroSpider(scrapy.Spider):
    name = "AlegroSpider"
    allowed_domains = ["allegro.pl"]
    start_urls = ["http://allegro.pl/samochody-osobowe-4029"]

    def parse(self, response):

        global linki

        if response.url not in linki:
            linki.append(response.url)

            for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):

                la = Allegro3Item()
                link = de.xpath('a/@href').extract()
                la['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
                la['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
                la['url'] = response.urljoin(link[0]).encode('utf-8')
                la['subcategory']=[]


                if la['url'] is not None:
                    if la['url'] not in linki:
                        linki.append(la['url'])

                        request = scrapy.Request(la['url'],callback=self.SearchFurther) 
                        #la['subcategory'].append(request.meta['la2'])
                yield la        

    def SearchFurther(self,response):
        global linki

        for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):

            link = de.xpath('a/@href').extract()
            la2 = Allegro3Item()
            la2['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
            la2['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
            la2['url'] = response.urljoin(link[0]).encode('utf-8')

            yield la2

In this code I'm trying to create class/iteam with: 在这段代码中,我尝试使用以下方法创建类/项目:

  1. name of the brand 品牌名称
  2. number of offers 报价数量
  3. link to subcateogry 链接到子类别
  4. list of elements of subcategory with the same data as in points 1-4 与第1-4点具有相同数据的子类别元素列表

I have problems with 4. When I create additionl request 'SearchFurther'. 我在使用4时遇到问题。创建附加请求“ SearchFurther”时。

request = scrapy.Request(la['url'],callback=self.SearchFurther) 

I don't know how to pass la2 item which is the result of SearchFurther to previous request, so I could append la2 to a the la[subcategory] as an additional element of the list (one brand can have many subcategories). 我不知道如何将SearchFurther结果的la2项传递给先前的请求,因此我可以将la2附加到la [subcategory]作为列表的其他元素(一个品牌可以具有许多子类别)。

I would be grateful for any help. 我将不胜感激。

Have a look at this documentation: http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions 请查看以下文档: http : //doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. 在某些情况下,您可能希望将参数传递给这些回调函数,以便稍后可以在第二个回调中接收参数。 You can use the Request.meta attribute for that. 您可以为此使用Request.meta属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM