简体   繁体   中英

Scrapy execution flow

i am trying to understand Scrapy execution but getting confused because of the generators used in between.i have little idea on generators but i am not able to visualize/correlate those things in here

below is the code from scrapy documentation

questions

1) How yield works here

2)I see two for loops in parse function ,1st for loop is calling parse_author function in the yield but is getting called only after for loop1(executing twice) and loop2(executing once).can some one please explain how the execution flow is happening here.

import scrapy
from datetime import datetime, timedelta
name = 'prox-reveal'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
    # follow links to author pages
    for href in response.css('.author + a::attr(href)'):
        print('1---------->{}'.format(datetime.now().strftime('%Y%m%d_%H%M%S-%f')))
        yield response.follow(href, self.parse_author)

    # follow pagination links
    for href in response.css('li.next a::attr(href)'):
        print('2---------->{}'.format(datetime.now().strftime('%Y%m%d_%H%M%S-%f')))
        yield response.follow(href, self.parse)

def parse_author(self, response):
    print('3---------->{}'.format(datetime.now().strftime('%Y%m%d_%H%M%S-%f')))
    def extract_with_css(query):
        return response.css(query).extract_first().strip()

    yield {
        'name': extract_with_css('h3.author-title::text'),
        'birthdate': extract_with_css('.author-born-date::text'),
        'bio': extract_with_css('.author-description::text'),
    }

thanks

A simplified overview of the relation between a request and its callback:

  • A Request object is created and passed to Scrapy's engine for further processing

     yield response.follow(href, self.parse_author) 
  • The requested webpage is downloaded and a Response object is created

  • The request's callback ( parse_author() ) is called with the created response

Now comes the part I believe is causing you trouble.

Scrapy is an asynchronous framework, it can do other things while waiting for I/O operations (such as downloading a webpage) to complete.

So your loop is continued, other requests are created and processed, and the callback will be called - once the data for it is available.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM