简体   繁体   中英

How do I obtain results from 'yield' in python?

Perhaps yield in Python is remedial for some, but not for me... at least not yet. I understand yield creates a 'generator'.

I stumbled upon yield when I decided to learn scrapy. I wrote some code for a Spider which works as follows:

  1. Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink
  2. Examines hyperlinks appends those meeting specific criteria to base hyperlink
  3. Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'
import scrapy

class newSpider(scrapy.Spider)
    name = 'new'
    allowed_domains = ['www.alloweddomain.com']
    start_urls = ['https://www.alloweddomain.com']

    def parse(self, response)
        links = response.xpath('//a/@href').extract()
        for link in links:
            if link == 'SpecificCriteria':
                next_link = response.urljoin(link)
                yield Request(next_link, callback=self.parse_new)

EDIT 1:

                for uid_dict in self.parse_new(response):
                   print(uid_dict['uid'])
                   break

End EDIT 1

Running the code here evaluates response as the HTTP response to start_urls and not to next_link .

    def parse_new(self, response)
        trs = response.xpath("//*[@class='unit-directory-row']").getall()
        for tr in trs:
            if 'SpecificText' in tr:
                elements = tr.split()
                for element in elements:
                    if 'onclick' in element:
                        subelement = element.split('(')[1]
                        uid = subelement.split(')')[0]
                        print(uid)
                        yield {
                            'uid': uid
                        }
                break

It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.

What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request .

I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield works.

In the meantime, spider.parse_new(response) is an iterable object. That is, you can acquire its yielded results via a for loop. Eg,

for uid_dict in spider.parse_new(response):
    print(uid_dict['uid'])

After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield: It has a lot to do with two issues:

1) robots.txt . Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py

2) The logger has Filtered offsite request to . Link dont_filter=True may resolve this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM