[英]How do I obtain results from 'yield' in python?
Perhaps yield
in Python is remedial for some, but not for me... at least not yet.也许 Python 的
yield
对某些人来说是补救措施,但对我来说不是……至少现在还没有。 I understand yield
creates a 'generator'.我了解
yield
创建了一个“发电机”。
I stumbled upon yield
when I decided to learn scrapy.当我决定学习 scrapy 时,我偶然发现了
yield
。 I wrote some code for a Spider which works as follows:我为 Spider 编写了一些代码,其工作方式如下:
import scrapy
class newSpider(scrapy.Spider)
name = 'new'
allowed_domains = ['www.alloweddomain.com']
start_urls = ['https://www.alloweddomain.com']
def parse(self, response)
links = response.xpath('//a/@href').extract()
for link in links:
if link == 'SpecificCriteria':
next_link = response.urljoin(link)
yield Request(next_link, callback=self.parse_new)
EDIT 1:编辑1:
for uid_dict in self.parse_new(response):
print(uid_dict['uid'])
break
End EDIT 1结束编辑 1
Running the code here evaluates response
as the HTTP response to start_urls
and not to next_link
.在此处运行代码会将
response
评估为 HTTP 对start_urls
的响应,而不是对next_link
的响应。
def parse_new(self, response)
trs = response.xpath("//*[@class='unit-directory-row']").getall()
for tr in trs:
if 'SpecificText' in tr:
elements = tr.split()
for element in elements:
if 'onclick' in element:
subelement = element.split('(')[1]
uid = subelement.split(')')[0]
print(uid)
yield {
'uid': uid
}
break
It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page.它起作用了,scrapy 抓取第一页,创建新的超链接并导航到下一页。 new_parser parses the HTML for the uid and 'yields' it.
new_parser 为 uid 解析 HTML 并“生成”它。 scrapy's engine shows that the correct uid is 'yielded'.
scrapy的引擎显示正确的uid是'yielded'。
What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request
.我不明白的是如何“使用”由 parse_new 获得的 uid 来创建和导航到新的超链接,就像我想要一个变量一样,我似乎无法使用
Request
返回一个变量。
I'd check out What does the "yield" keyword do?我会检查“yield”关键字的作用是什么? for a good explanation of how exactly
yield
works.很好地解释了
yield
的工作原理。
In the meantime, spider.parse_new(response)
is an iterable object.同时,
spider.parse_new(response)
是一个可迭代的 object。 That is, you can acquire its yielded results via a for
loop.也就是说,您可以通过
for
循环获取其产生的结果。 Eg,例如,
for uid_dict in spider.parse_new(response):
print(uid_dict['uid'])
After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield: It has a lot to do with two issues:经过大量阅读和学习,我发现了 scrapy 在第一次解析中不执行回调的原因,它与 yield 无关:它与两个问题有很大关系:
1) robots.txt
. 1)
robots.txt
。 Link Can be 'resolved' with ROBOTSTXT_OBEY = False
in settings.py可以在 settings.py 中使用
ROBOTSTXT_OBEY = False
来“解决”链接
2) The logger has Filtered offsite request to
. 2) 记录器已将
Filtered offsite request to
. Link dont_filter=True
may resolve this. 链接
dont_filter=True
可以解决这个问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.