简体   繁体   中英

Scrapy Recusrive (CrawlSpider) not crawling all links as expected

So my issue is that I have this CrawlSpider

name = 'recursiveSpider'
    allowed_domains = ['industrialnetworking.com']
    
    custom_settings = {
        'DUPEFILTER_CLASS' : 'scrapy.dupefilters.BaseDupeFilter',
    }
    start_urls = [        
        'https://www.industrialnetworking.com/Manufacturers/Hirschmann'
    ]
    rules = (
        Rule(LinkExtractor(restrict_css='div.catCell a::attr(href)'), follow=True),
        Rule(LinkExtractor(allow=r"/Manufacturers/Hirschmann*"), callback='parse_new_item')
    )

I am trying to hit the product pages of all "Hirshmann" products. I understand that my error is within the 2nd line of the "rules" where I have it allowing anything with Hirschmann*. Although I am unsure how to add a response.css/response.xpath as an argument for allow.

Ideally I would like it so that if the crawler all "div.catCell a:attr(href)" and recursive through them until it detects "response.css('td.cellDesc h2 a::attr(href)')", then it will send that link to my "parse_new_item". If that item is not found then continue with following all links that have "div.catCell a:attr(href)".

Example URL travel path ->
StartURL: https://www.industrialnetworking.com/Manufacturers/Hirschmann
Category: https://www.industrialnetworking.com/Manufacturers/Hirschmann-Rail-Switches
SubCategory: https://www.industrialnetworking.com/Manufacturers/Hirschmann-Switches-Unmanaged
Series: https://www.industrialnetworking.com/Manufacturers/Hirschmann-SPIDER-Family-Rail-Switches
END GOAL ->
Product: https://www.industrialnetworking.com/Manufacturers/Hirschmann-SPIDER-III-Rail-Switches/Hirschmann-SSL20-5TX-Rail-Switch-942-132-001

EDIT - Reason I am targeting the xpath/css path is because the links do not have any obvious pattern that I can use to target the urls.

Thanks everyone!

Your above mentioned webpage contains 14 listing urls.So you can use either xpath or css selectors only and you have to use follow = False to get rid of unnessary urls

from scrapy.crawler import CrawlerProcess
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TestSpider(CrawlSpider):
    name = 'test'
   
    allowed_domains = ['industrialnetworking.com']
    start_urls = ['https://www.industrialnetworking.com/Manufacturers/Hirschmann']
    


    rules = (
        Rule(LinkExtractor(allow=r'/Manufacturers/Hirschmann-'), follow = True, callback = 'parse_item'),
        )
    
    def parse_item(self, response):
        yield {
            'Title': response.xpath('//*[@id="itmNam"]/h1/text()').get()
            }

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

I personally am not a big fan of the crawlspider . There are some cases when it is convenient, but I think in your situation sticking to crawling the links manually might be an easier approach.

Since you have multiple pages with the same format what you could do is feed each of the links back into the main parse method, until it finds the links that match the td/h2/a links, at which point it can then assign a different callback to parse the final product page using your parse_new_item method.

For example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'recursiveSpider'
    allowed_domains = ['industrialnetworking.com']
    start_urls = ['https://www.industrialnetworking.com/Manufacturers/Hirschmann']

    def parse(self, response):
        for url in response.xpath("//div[@class='catCell']/a/@href").getall():
            yield scrapy.Request(response.urljoin(url), callback=self.parse)
        for url in response.xpath("//td[@class='cellDesc']/h2/a/@href").getall():
            yield scrapy.Request(response.urljoin(url), callback=self.parse_new_item)

    def parse_new_item(self, response):
        print(response)
        item_name = response.xpath("//div[@id='itmNam']/h1/text()").get()
        item = {"name": item_name}
        yield item

The output is really long so I just put the final tally below.

OUTPUT

<200 https://www.industrialnetworking.com/Manufacturers/Hirschmann-Greyhound-Switch-Power-Accessories/Hirschmann-Greyhound-1040-Industrial-Power-Supply-GPS1-KSY9HH>
2022-09-14 13:32:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.industrialnetworking.com/Manufacturers/Hirschmann-Greyhound-Switch-Power-Accessories/Hirschmann-Greyhound-1040-Industrial-Power-Supply-GPS1-KSY9H
H>
{'name': 'GPS1-KSY9HH Power Supply'}
2022-09-14 13:32:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-14 13:32:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 380892,
 'downloader/request_count': 483,
 'downloader/request_method_count/GET': 483,
 'downloader/response_bytes': 9139340,
 'downloader/response_count': 483,
 'downloader/response_status_count/200': 471,
 'downloader/response_status_count/429': 12,
 'elapsed_time_seconds': 22.988552,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 9, 14, 20, 32, 12, 287889),
 'httpcompression/response_bytes': 41356802,
 'httpcompression/response_count': 471,
 'httperror/response_ignored_count': 4,
 'httperror/response_ignored_status_count/429': 4,
 'item_scraped_count': 401,
 'log_count/DEBUG': 889,
 'log_count/ERROR': 4,
 'log_count/INFO': 14,
 'request_depth_max': 5,
 'response_received_count': 475,
 'retry/count': 8,
 'retry/max_reached': 4,
 'retry/reason_count/429 Unknown Status': 8,
 'scheduler/dequeued': 483,
 'scheduler/dequeued/memory': 483,
 'scheduler/enqueued': 483,
 'scheduler/enqueued/memory': 483,
 'start_time': datetime.datetime(2022, 9, 14, 20, 31, 49, 299337)}
2022-09-14 13:32:12 [scrapy.core.engine] INFO: Spider closed (finished)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM