简体   繁体   English

Scrapy 抓取嵌套网址

[英]Scrapy crawl nested urls

Introduction介绍

As i have to go more deeper in crawling, i face my next problem: crawling nested pages like: https://www.karton.eu/Faltkartons由于我必须更深入地爬行 go,因此我面临下一个问题:爬行嵌套页面,例如: https://www.karton.eu/Faltkartons

My crawler has to start at this page, goes to https://www.karton.eu/Einwellige-Kartonagen and visit every product listed in this category.我的爬虫必须从这个页面开始,转到https://www.karton.eu/Einwellige-Kartonagen并访问此类别中列出的每个产品。

It should do that with every subcategory of "Faltkartons" for every single product contained in every category.对于每个类别中包含的每个产品,它都应该对“Faltkartons”的每个子类别做到这一点。

EDITED已编辑

My code now looks like this:我的代码现在看起来像这样:

import scrapy
from ..items import KartonageItem

class KartonSpider(scrapy.Spider):
    name = "kartons12"
    allow_domains = ['karton.eu']
    start_urls = [
        'https://www.karton.eu/Faltkartons'
        ]
    custom_settings = {'FEED_EXPORT_FIELDS': ['SKU', 'Title', 'Link', 'Price', 'Delivery_Status', 'Weight', 'QTY', 'Volume'] } 
    
    def parse(self, response):
        url = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_category_cartons)

    def parse_category_cartons(self, response):
        url2 = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url2:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_target_page)

    def parse_target_page(self, response):
        card = response.xpath('//div[@class="text-center articelbox"]')

        for a in card:
            items = KartonageItem()
            link = a.xpath('a/@href')
            items ['SKU'] = a.xpath('.//div[@class="delivery-status"]/small/text()').get()
            items ['Title'] = a.xpath('.//h5[@class="title"]/a/text()').get()
            items ['Link'] = a.xpath('.//h5[@class="text-center artikelbox"]/a/@href').extract()
            items ['Price'] = a.xpath('.//strong[@class="price-ger price text-nowrap"]/span/text()').get()
            items ['Delivery_Status'] = a.xpath('.//div[@class="signal_image status-2"]/small/text()').get()
            yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})

    def parse_item(self,response):
        table = response.xpath('//div[@class="product-info-inner"]')

        items = KartonageItem()
        items = response.meta['items']
        items['Weight'] = a.xpath('.//span[@class="staffelpreise-small"]/text()').get()
        items['Volume'] = a.xpath('.//td[@class="icon_contenct"][7]/text()').get()
        yield items

In my head it starts at the the start_url, then i visits https://www.karton.eu/Einwellige-Kartonagen , looking for links and follow them to https://www.karton.eu/einwellig-ab-100-mm .On that page it checks the cards for some information and follow link to the specific product page to get the last items.在我的脑海中,它从 start_url 开始,然后我访问https://www.karton.eu/Einwellige-Kartonagen ,寻找链接并跟随它们到https://www.karton.eu/einwellig-ab-100-毫米。在该页面上,它会检查卡片以获取一些信息,然后点击特定产品页面的链接以获取最后的项目。

Which part(s) of my method is/are wrong?我的方法的哪些部分是错误的? Should i change my class from "scrapy.Spider" to "crawl.spider"?我应该将我的 class 从“scrapy.Spider”更改为“crawl.spider”吗? or is this only needed if i want to set some rules?还是仅在我想设置一些规则时才需要?

It could be still possible, that my xpaths of the title,sku etc may be wrong, but at the very first, i want just build my basics, to crawl these nested pages仍然有可能,我的标题、sku 等的 xpath 可能是错误的,但首先,我只想建立我的基础知识,以抓取这些嵌套页面

My console output:我的控制台 output:

控制台输出

finally i managed to go through all these pages, but somehow my.csv-file is still empty最后我通过所有这些页面设法 go ,但不知何故 my.csv-file 仍然是空的

According to the comments you provided, the issue starts with you skipping a request in your chain.根据您提供的评论,问题始于您跳过链中的请求。

Your start_urls will request this page: https://www.karton.eu/Faltkartons The page will be parse by the parse method and yield new requests from https://www.karton.eu/Karton-weiss to https://www.karton.eu/Einwellige-Kartonagen Your start_urls will request this page: https://www.karton.eu/Faltkartons The page will be parse by the parse method and yield new requests from https://www.karton.eu/Karton-weiss to https:// www.karton.eu/Einwellige-Kartonagen

Those pages will be parsed in the parse_item method, but they are not the final page you want.这些页面将在parse_item方法中进行解析,但它们不是您想要的最终页面。 You need to parse between the cards and yield new requests, like this:您需要在卡片之间进行解析并产生新的请求,如下所示:

for url in response.xpath('//div[@class="cat-thumbnails"]/div/a/@href')
    yield scrapy.Request(response.urljoin(url.get()), callback=self.new_parsing_method)

Example here, when parsing https://www.karton.eu/Zweiwellige-Kartons will find 9 new links from此处的示例,在解析https://www.karton.eu/Zweiwellige-Kartons时会发现 9 个新链接来自

Finally you need a parsing method to scrape the items in those pages.最后,您需要一种解析方法来抓取这些页面中的项目。 Since there are more than one item, I suggest you to run them in a for loop.由于项目不止一项,我建议您在 for 循环中运行它们。 (You need the proper xpath to scrape the data.) (您需要正确的 xpath 来抓取数据。)

EDIT:编辑:

Re-editing as now I observed the page structure and saw that my code was base on the wrong assumption.现在重新编辑,我观察了页面结构,发现我的代码是基于错误的假设。 The thing is that some pages don't have the subcategory page, others do.问题是某些页面没有子类别页面,而其他页面则有。

Page structure:页面结构:

ROOT: www.karton.eu/Faltkartons
 |_ Einwellige Kartons
    |_ Subcategory: Kartons ab 100 mm Länge
      |_ Item List (www.karton.eu/einwellig-ab-100-mm)
        |_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
    ...
    |_ Subcategory: Kartons ab 1000 mm Länge
      |_ ...
 |_ Zweiwellige Kartons #Same as above
 |_ Lange Kartons #Same as above
 |_ quadratische Kartons #There is no subcategory
    |_ Item List (www.karton.eu/quadratische-Kartons)
      |_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
 |_ Kartons Höhenvariabel #There is no subcategory
 |_ Kartons weiß #There is no subcategory

The code bellow will scrape items from the pages with subcategories, as I think that's what you want.下面的代码将从带有子类别的页面中抓取项目,因为我认为这就是您想要的。 Either way I left a print statements to show you pages that will be skipped due to having no subcategory page.无论哪种方式,我都留下了print语句,向您展示由于没有子类别页面而将被跳过的页面。 In case you want to include them later.如果您想稍后包含它们。

import scrapy
from ..items import KartonageItem

class KartonSpider(scrapy.Spider):
    name = "kartons12"
    allow_domains = ['karton.eu']
    start_urls = [
        'https://www.karton.eu/Faltkartons'
        ]
    custom_settings = {'FEED_EXPORT_FIELDS': ['SKU', 'Title', 'Link', 'Price', 'Delivery_Status', 'Weight', 'QTY', 'Volume'] } 
    
    def parse(self, response):
        url = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_category_cartons)

    def parse_category_cartons(self, response):
        url2 = response.xpath('//div[@class="cat-thumbnails"]')

        if not url2:
            print('Empty url2:', response.url)

        for a in url2:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_target_page)

    def parse_target_page(self, response):
        card = response.xpath('//div[@class="text-center artikelbox"]')

        for a in card:
            items = KartonageItem()
            link = a.xpath('a/@href')
            items ['SKU'] = a.xpath('.//div[@class="delivery-status"]/small/text()').get()
            items ['Title'] = a.xpath('.//h5[@class="title"]/a/text()').get()
            items ['Link'] = a.xpath('.//h5[@class="text-center artikelbox"]/a/@href').extract()
            items ['Price'] = a.xpath('.//strong[@class="price-ger price text-nowrap"]/span/text()').get()
            items ['Delivery_Status'] = a.xpath('.//div[@class="signal_image status-2"]/small/text()').get()
            yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})

    def parse_item(self,response):
        table = response.xpath('//div[@class="product-info-inner"]')

        #items = KartonageItem() # You don't need this here, as the line bellow you are overwriting the variable.
        items = response.meta['items']
        items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
        items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()
        yield items

Notes笔记

Changed this:改变了这个:

    card = response.xpath('//div[@class="text-center articelbox"]')

to this: (K instead of C)对此:(K而不是C)

    card = response.xpath('//div[@class="text-center artikelbox"]')

Commented this, as the items in meta is already a KartonageItem .对此进行了评论,因为 meta 中的项目已经是KartonageItem (You can remove it) (您可以删除它)

def parse_item(self,response):
    table = response.xpath('//div[@class="product-info-inner"]')
    #items = KartonageItem()
    items = response.meta['items']

Changed this in the parse_items method :parse_items方法中改变了这一点:

    items['Weight'] = a.xpath('.//span[@class="staffelpreise-small"]/text()').get()
    items['Volume'] = a.xpath('.//td[@class="icon_contenct"][7]/text()').get()

To this:对此:

    items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
    items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()

As a doesn't exists in that method.因为a在该方法中不存在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM