Perhaps yield
in Python is remedial for some, but not for me... at least not yet. I understand yield
creates a 'generator'.
I stumbled upon yield
when I decided to learn scrapy. I wrote some code for a Spider which works as follows:
import scrapy
class newSpider(scrapy.Spider)
name = 'new'
allowed_domains = ['www.alloweddomain.com']
start_urls = ['https://www.alloweddomain.com']
def parse(self, response)
links = response.xpath('//a/@href').extract()
for link in links:
if link == 'SpecificCriteria':
next_link = response.urljoin(link)
yield Request(next_link, callback=self.parse_new)
EDIT 1:
for uid_dict in self.parse_new(response):
print(uid_dict['uid'])
break
End EDIT 1
Running the code here evaluates response
as the HTTP response to start_urls
and not to next_link
.
def parse_new(self, response)
trs = response.xpath("//*[@class='unit-directory-row']").getall()
for tr in trs:
if 'SpecificText' in tr:
elements = tr.split()
for element in elements:
if 'onclick' in element:
subelement = element.split('(')[1]
uid = subelement.split(')')[0]
print(uid)
yield {
'uid': uid
}
break
It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.
What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request
.
I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield
works.
In the meantime, spider.parse_new(response)
is an iterable object. That is, you can acquire its yielded results via a for
loop. Eg,
for uid_dict in spider.parse_new(response):
print(uid_dict['uid'])
After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield: It has a lot to do with two issues:
1) robots.txt
. Link Can be 'resolved' with ROBOTSTXT_OBEY = False
in settings.py
2) The logger has Filtered offsite request to
. Link dont_filter=True
may resolve this.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.