scrapy python CrawlSpider not crawling

Question

import scrapy 
from scrapy.spiders.crawl import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class MySpider(CrawlSpider):
    name = 'genericSpider'
    allowed_domains = ['example.com']
    start_urls = [url_1, url_2, url_3]

    rules = [
        Rule(
            LinkExtractor(),                     
            callback='parse',   
            follow=True        
        ),
    ]

    def parse(self, response): 
        hxs = scrapy.Selector(response)
        links = hxs.xpath('*//a/@href').extract()
        for link in links:
            print(link)
        print()

I'm attempting to crawl a website. For an example of my code, I'm just extracting all links and printing them out to the terminal.

This process works great for the urls in the start_urls, but it doesn't seem that the spider will crawl the extracted urls.

This is the point of the CrawlSpider, correct? visit a page, collect its links and visit all those links until it runs out of them?

I've been stuck for a few days, any help would be great.

Answer 1

The problem is that you name your method parse . As per the documentation , this name should be avoided in case of using CrawlSpider as it leads to problems. Just rename the method to eg parse_link (and adjust the callback argument in Rule ) and it will work.

Also, remember that allowed_domains attribute must match with URLs you intend to crawl.

scrapy python CrawlSpider not crawling

Question

1 answers

solution1
1 2019-04-24 08:48:19

scrapy python CrawlSpider not crawling

Question

1 answers

solution1 1 2019-04-24 08:48:19

solution1
1 2019-04-24 08:48:19