简体   繁体   English

如何从 python 中的“产量”中获取结果?

[英]How do I obtain results from 'yield' in python?

Perhaps yield in Python is remedial for some, but not for me... at least not yet.也许 Python 的yield对某些人来说是补救措施,但对我来说不是……至少现在还没有。 I understand yield creates a 'generator'.我了解yield创建了一个“发电机”。

I stumbled upon yield when I decided to learn scrapy.当我决定学习 scrapy 时,我偶然发现了yield I wrote some code for a Spider which works as follows:我为 Spider 编写了一些代码,其工作方式如下:

  1. Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink Go 开始超链接并提取所有超链接 - 这不是完整的超链接,只是连接到起始超链接的子目录
  2. Examines hyperlinks appends those meeting specific criteria to base hyperlink检查超链接将满足特定标准的超链接附加到基本超链接
  3. Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'使用 Request 导航到新的超链接并解析以在具有“onclick”的元素中查找唯一 ID
import scrapy

class newSpider(scrapy.Spider)
    name = 'new'
    allowed_domains = ['www.alloweddomain.com']
    start_urls = ['https://www.alloweddomain.com']

    def parse(self, response)
        links = response.xpath('//a/@href').extract()
        for link in links:
            if link == 'SpecificCriteria':
                next_link = response.urljoin(link)
                yield Request(next_link, callback=self.parse_new)

EDIT 1:编辑1:

                for uid_dict in self.parse_new(response):
                   print(uid_dict['uid'])
                   break

End EDIT 1结束编辑 1

Running the code here evaluates response as the HTTP response to start_urls and not to next_link .在此处运行代码会将response评估为 HTTP 对start_urls的响应,而不是对next_link的响应。

    def parse_new(self, response)
        trs = response.xpath("//*[@class='unit-directory-row']").getall()
        for tr in trs:
            if 'SpecificText' in tr:
                elements = tr.split()
                for element in elements:
                    if 'onclick' in element:
                        subelement = element.split('(')[1]
                        uid = subelement.split(')')[0]
                        print(uid)
                        yield {
                            'uid': uid
                        }
                break

It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page.它起作用了,scrapy 抓取第一页,创建新的超链接并导航到下一页。 new_parser parses the HTML for the uid and 'yields' it. new_parser 为 uid 解析 HTML 并“生成”它。 scrapy's engine shows that the correct uid is 'yielded'. scrapy的引擎显示正确的uid是'yielded'。

What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request .我不明白的是如何“使用”由 parse_new 获得的 uid 来创建和导航到新的超链接,就像我想要一个变量一样,我似乎无法使用Request返回一个变量。

I'd check out What does the "yield" keyword do?我会检查“yield”关键字的作用是什么? for a good explanation of how exactly yield works.很好地解释了yield的工作原理。

In the meantime, spider.parse_new(response) is an iterable object.同时, spider.parse_new(response)是一个可迭代的 object。 That is, you can acquire its yielded results via a for loop.也就是说,您可以通过for循环获取其产生的结果。 Eg,例如,

for uid_dict in spider.parse_new(response):
    print(uid_dict['uid'])

After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield: It has a lot to do with two issues:经过大量阅读和学习,我发现了 scrapy 在第一次解析中不执行回调的原因,它与 yield 无关:它与两个问题有很大关系:

1) robots.txt . 1) robots.txt Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py可以在 settings.py 中使用ROBOTSTXT_OBEY = False来“解决”链接

2) The logger has Filtered offsite request to . 2) 记录器已将Filtered offsite request to . Link dont_filter=True may resolve this. 链接dont_filter=True可以解决这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM