[英]How to reverse list order in Python and stop yield upon return of None?
I am generating pagination links which I suspect exists with Python 3.x:我正在生成我怀疑 Python 3.x 存在的分页链接:
start_urls = [
'https://...',
'https://...' # list full of URLs
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
meta={'handle_httpstatus_list': [301]},
callback=self.parse,
)
def parse(self, response):
for i in range(1, 6):
url = response.url + '&pn='+str(i)
yield scrapy.Request(url, self.parse_item)
def parse_item(self, response):
# check if no results page
if response.xpath('//*[@id="searchList"]/div[1]').extract_first():
self.logger.info('No results found on %s', response.url)
return None
...
Those URLs will be processed by scrapy in parse_item.这些 URL 将由 parse_item 中的 scrapy 处理。 Now there are 2 problems:
现在有2个问题:
The order is reverse and I do not understand why.顺序是相反的,我不明白为什么。 It will request pagen umbers: 5,4,3,2,1 instead of 1,2,3,4,5
它将请求页码:5,4,3,2,1 而不是 1,2,3,4,5
If the no results are found on page 1, the entire series could be stoped.如果在第 1 页上没有找到任何结果,则可以停止整个系列。 Parse Item returns already "None", but the I guess I need to adapt the method "parse" to exit the for loop and continue.
Parse Item 已经返回“None”,但我想我需要调整“parse”方法以退出 for 循环并继续。 How?
如何?
The scrapy.Request
you generate are running in parallel - In other words, there is guarantee for the order how you get the response as it depends on the server.您生成的
scrapy.Request
是并行运行的 - 换句话说,可以保证您获得响应的顺序,因为它取决于服务器。
If some of the requests, depends on the response of of a request, you should yield those requests in its parse callback.如果某些请求取决于请求的响应,则应在其解析回调中产生这些请求。
For example:例如:
def parse(self, response):
url = response.url + '&pn='+str(1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(page=1, base_url=response.url))
def parse_item(self, response,page, base_url):
# check if no results page
if response.xpath('//*[@id="searchList"]/div[1]').extract_first():
if page < 6:
url = base_url + '&pn='+str(page+1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,page=page+1))
else:
# your code
yield ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.