简体   繁体   中英

python scrapy Two-direction crawling with a spider

I am reading learning scrapy by Dimitrios Kouzis-Loukas. Actually I have a question of the Two-direction crawling with a spider part in chapter 3 page58.

The original code is like:

def parse(self, response):
# Get the next index URLs and yield Requests
    next_selector = response.xpath('//*[contains(@class,"next")]//@href')
    for url in next_selector.extract():
        yield Request(urlparse.urljoin(response.url, url))

# Get item URLs and yield Requests
    item_selector = response.xpath('//*[@itemprop="url"]/@href')
    for url in item_selector.extract():
        yield Request(urlparse.urljoin(response.url, url), 
      callback=self.parse_item)`

But from my understanding, should the second loop block be included into the first one so that we can first download the index page and then download all the information pages in the first page, after that move onto the next index page?

So I just wanna know the operating order of the original code, please help!

You can't really merge the two loops.

The Request objects yielded in them have different callbacks.
The first one will be processed by the parse method (which seems to be parsing a listing of multiple items), and the second by the parse_item method (probably parsing the details of a single item).

As for the order of scraping, scrapy (by default) uses a LIFO queue, which means the last request created will be processed first.
However, due to the asynchronous nature of scrapy, it's impossible to say what the exact order will be.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM