简体   繁体   English

如何反转 Python 中的列表顺序并在返回 None 时停止收益?

[英]How to reverse list order in Python and stop yield upon return of None?

I am generating pagination links which I suspect exists with Python 3.x:我正在生成我怀疑 Python 3.x 存在的分页链接:

start_urls = [
    'https://...',
    'https://...' # list full of URLs
]

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url = url,
            meta={'handle_httpstatus_list': [301]},
            callback=self.parse,
        )

def parse(self, response):
    for i in range(1, 6):
        url = response.url + '&pn='+str(i)
        yield scrapy.Request(url, self.parse_item)

def parse_item(self, response):

        # check if no results page
        if response.xpath('//*[@id="searchList"]/div[1]').extract_first():
            self.logger.info('No results found on  %s', response.url)
            return None
        ...

Those URLs will be processed by scrapy in parse_item.这些 URL 将由 parse_item 中的 scrapy 处理。 Now there are 2 problems:现在有2个问题:

  1. The order is reverse and I do not understand why.顺序是相反的,我不明白为什么。 It will request pagen umbers: 5,4,3,2,1 instead of 1,2,3,4,5它将请求页码:5,4,3,2,1 而不是 1,2,3,4,5

  2. If the no results are found on page 1, the entire series could be stoped.如果在第 1 页上没有找到任何结果,则可以停止整个系列。 Parse Item returns already "None", but the I guess I need to adapt the method "parse" to exit the for loop and continue. Parse Item 已经返回“None”,但我想我需要调整“parse”方法以退出 for 循环并继续。 How?如何?

The scrapy.Request you generate are running in parallel - In other words, there is guarantee for the order how you get the response as it depends on the server.您生成的scrapy.Request是并行运行的 - 换句话说,可以保证您获得响应的顺序,因为它取决于服务器。

If some of the requests, depends on the response of of a request, you should yield those requests in its parse callback.如果某些请求取决于请求的响应,则应在其解析回调中产生这些请求。

For example:例如:

def parse(self, response):
    url = response.url + '&pn='+str(1)
    yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(page=1, base_url=response.url))
                             

def parse_item(self, response,page, base_url):
        # check if no results page
        if response.xpath('//*[@id="searchList"]/div[1]').extract_first():
            if page < 6:
                url = base_url + '&pn='+str(page+1)
                yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,page=page+1))
        else:
            # your code
            yield ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM