简体   繁体   中英

Scrapy CrawlSpider Output while Crawling

I'm trying to learn Scrapy framework and I'm able to write a spider and crawl around the web and so forth. I'm also able to save the desired data but not in a way I would like to do.

Example Code:

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class ExampleSpider(CrawlSpider):
        name = 'examplecrawler'
        allowed_domains = ['example.com']
        start_urls = ['https://www.example/']
        rules = [
            Rule(LinkExtractor(unique=True), follow=True, callback="parse")
        ]
    
        def parse(self, response):
            url = response.url
            yield {'link': url}

Current Result: Spider runs recursively and will only write output using Item Exporters when I stop it using Control + C

Desired Result: Spider runs recursively and write to output while running, not having to stop it to write output.

I have read through the documentation and see where I could possibly use something like writing a custom pipeline to write the data, but I was wondering if this was possible with the current item exporters. ie: csv and json.

In order to modify the way your current crawler works so it prints out it's real time status you will have to modify the existing code of the base class or create a crawler yourself . Since you are importing an existing module you really have no way of changing how it works so your best (if not only) bet is to create you own crawler with customized output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM