简体   繁体   中英

Which parse method scrapy used to parse start_urls

I want scrapy to scrape some start urls and then follow the links in those pages according to rules. My spider is inherited from CrawlSpider and has start_urls and 'rules' set. But it doesn't seems to use the parse function I defines to parse the start_urls. Here are the codes:

<!-- language: lang-python --> 
class ZhihuSpider(CrawlSpider):

    start_urls = ["https://www.zhihu.com/topic/19778317/organize/entire",
        "https://www.zhihu.com/topic/19778287/organize/entire"]

    rules = (Rule(LinkExtractor(allow= (r'topic/\d+/organize/entire')), \
            process_request='request_tagInfoPage', callback = 'parse_tagPage'))

    # this is the parse_tagPage() scrapy should use to scrape
    def parse_tagPage():
        print("start scraping!") # Explicitly print to show that scraping starts
        # do_something

However, the console shows that scrapy crawled start_urls but nothing printed. So I am pretty sure that the parse_tagPage() function isn't called. Even though, scrapy shows that the urls is crawled [scrapy] DEBUG: Crawled (200) <GET https://www.zhihu.com/topic/19778317/organize/entire> (referer: http://www.zhihu.com)

Any hints on why this would happen and how to set scrapy to use parse_tagPage()?

1st, the CrawlSpider class uses a default parse() method to deal with ALL requests that doesn't specifies a callback function, in my case including the requests made from start_urls. This parse() method only applies rules to extract links and doesn't parse the pages of start_url at all. That's why I can't scrape anything from the start_url pages.

2nd, the LinkExtractor somehow only extracts the first links from start_urls pages. And unfortunately, the first links are start_urls themselves. So the scrapy internal duplication-preventing mechanism blocks parsing those pages. That's why the callback function parse_tagPage() is called.

I am working on fixing the LinkExtractor.

you can overwrite parse_start_url() that to parse start_urls

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM