Which parse method scrapy used to parse start_urls

Question

I want scrapy to scrape some start urls and then follow the links in those pages according to rules. My spider is inherited from CrawlSpider and has start_urls and 'rules' set. But it doesn't seems to use the parse function I defines to parse the start_urls. Here are the codes:

<!-- language: lang-python --> 
class ZhihuSpider(CrawlSpider):

    start_urls = ["https://www.zhihu.com/topic/19778317/organize/entire",
        "https://www.zhihu.com/topic/19778287/organize/entire"]

    rules = (Rule(LinkExtractor(allow= (r'topic/\d+/organize/entire')), \
            process_request='request_tagInfoPage', callback = 'parse_tagPage'))

    # this is the parse_tagPage() scrapy should use to scrape
    def parse_tagPage():
        print("start scraping!") # Explicitly print to show that scraping starts
        # do_something

However, the console shows that scrapy crawled start_urls but nothing printed. So I am pretty sure that the parse_tagPage() function isn't called. Even though, scrapy shows that the urls is crawled [scrapy] DEBUG: Crawled (200) <GET https://www.zhihu.com/topic/19778317/organize/entire> (referer: http://www.zhihu.com)

Any hints on why this would happen and how to set scrapy to use parse_tagPage()?

Answer 1

1st, the CrawlSpider class uses a default parse() method to deal with ALL requests that doesn't specifies a callback function, in my case including the requests made from start_urls. This parse() method only applies rules to extract links and doesn't parse the pages of start_url at all. That's why I can't scrape anything from the start_url pages.

2nd, the LinkExtractor somehow only extracts the first links from start_urls pages. And unfortunately, the first links are start_urls themselves. So the scrapy internal duplication-preventing mechanism blocks parsing those pages. That's why the callback function parse_tagPage() is called.

I am working on fixing the LinkExtractor.

Answer 2

you can overwrite parse_start_url() that to parse start_urls

Which parse method scrapy used to parse start_urls

Question

2 answers

solution1
0 2016-07-07 08:09:03

solution2
0 2022-05-05 07:31:36

Which parse method scrapy used to parse start_urls

Question

2 answers

solution1 0 2016-07-07 08:09:03

solution2 0 2022-05-05 07:31:36

solution1
0 2016-07-07 08:09:03

solution2
0 2022-05-05 07:31:36