Google Cloud Storage Python Client Library 中的 list_blobs function 中的分页是如何工作的

Question

I want to get a list of all the blobs in a Google Cloud Storage bucket using the Client Library for Python .我想使用Python 的客户端库获取Google Cloud Storage 存储桶中所有 blob 的列表。

According to the documentation I should use the list_blobs() function. The function appears to use two arguments max_results and page_token to achieve paging.根据文档我应该使用list_blobs() function。function似乎使用两个arguments max_results和page_token来实现分页。 I am not sure how use them.我不确定如何使用它们。

In particular, where do I get the page_token from?特别是，我从哪里获得page_token ？

I would have expected that list_blobs() would provide a page_token for use in subsequent calls, but I cannot find any documentation on it.我原以为list_blobs()会提供一个page_token供后续调用使用，但我找不到任何关于它的文档。

In addition, max_results is optional.此外， max_results是可选的。 What happens if I don't provide it?如果我不提供会怎样？ Is there a default limit?有默认限制吗？ If so, what is it?如果是这样，它是什么？

Answer 1

list_blobs() does use paging, but you do not use page_token to achieve it. list_blobs()确实使用分页，但你不使用page_token来实现它。

How It Works: 这个怎么运作：

The way list_blobs() work is that it returns an iterator that iterates through all the results doing paging behind the scenes . list_blobs()工作的方式是它返回一个迭代器，迭代遍历在幕后进行分页的 所有结果 。 So simply doing this will get you through all the results, fetching pages as needed: 因此，只需执行此操作即可完成所有结果，根据需要获取页面：

for blob in bucket.list_blobs()
    print blob.name

The Documentation is Wrong/Misleading: 文档错误/误导：

As of 04/26/2017 this is what the docs says: 截至2017年4月26日，这是文档所说的内容：

page_token (str) – (Optional) Opaque marker for the next “page” of blobs. page_token （str） - （可选）blob的下一个“页面”的不透明标记。 If not passed, will return the first page of blobs. 如果没有通过，将返回blob的第一页。

This implies that the result will be a single page of results with page_token determining which page. 这意味着结果将是单页结果， page_token确定哪个页面。 This is not correct. 这是不正确的。 The result iterator iterates through multiple pages. 结果迭代器遍历多个页面。 What page_token actually represents is which page the iterator should START at. page_token实际代表的是迭代器应该从哪个页面开始。 It no page_token is provided it will start at the first page. 没有提供page_token它将从第一页开始。

Helpful To Know: 有用的知道：

max_results limits the total number of results returned by the iterator. max_results限制迭代器返回的结果总数。

The iterator does expose pages if you need it: 如果需要，迭代器会公开页面：

for page in bucket.list_blobs().pages:
    for blob in page:
        print blob.name

Answer 2

I'm just going to leave this here.我要把这个留在这里。 I'm not sure if the libraries have changes in the last 2 years since this answer was posted, but if you're using prefix, then for blob in bucket.list_blobs() doesn't work right.我不确定自发布此答案以来库在过去 2 年中是否有变化，但如果您使用前缀，则for blob in bucket.list_blobs()无法正常工作。 It seems like getting blobs and getting prefixes are fundamentally different.看起来获取 blob 和获取前缀是根本不同的。 And using pages with prefixes is confusing.使用带前缀的页面会让人感到困惑。

I found a post in a github issue ( here ).我在 github 问题（此处）中找到了一篇文章。 This works for me.这对我有用。

def list_gcs_directories(bucket, prefix):
    # from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print page, page.prefixes
        prefixes.update(page.prefixes)
    return prefixes

A different comment on the same issue suggested this:对同一问题的不同评论表明：

def get_prefixes(bucket):
    iterator = bucket.list_blobs(delimiter="/")
    response = iterator._get_next_page_response()
    return response['prefixes']

Which only gives you the prefixes if all of your results fit on a single page.如果您的所有结果都放在一个页面上，它只会给您前缀。

Answer 3

Please read the inline comments:请阅读内联评论：

from google.cloud import storage

storage = storage.Client()

bucket_name = ''  # Fill here your bucket name

# This will limit number of results - replace this with None in order to get all the blobs in the bucket
max_results = 23_344 

# Please specify the "nextPageToken" in order to trigger an implicit pagination 
# (which is managed for you by the library).
# Moreover, you'll need to specify the "items" with all the fields you would like to fetch.
# Here are the supported fields: https://cloud.google.com/storage/docs/json_api/v1/objects#resource

fields = 'items(name),nextPageToken'

counter = 0
for blob in storage.list_blobs(bucket_name, fields=fields, max_results=max_results):
    counter += 1
    print(counter, ')', blob.name)

Answer 4

I wanted to extract only the subfolders out of a GCP bucket.我只想从 GCP 存储桶中提取子文件夹。 Some of the other methods do work but only if they all fit in a particular page.其他一些方法确实有效，但前提是它们都适合特定页面。 I found something like the following that works for me:我发现类似以下对我有用的东西：

blobs = self.gcp_client.list_blobs(bucket_name, prefix='subfolder/', delimiter='/')

for page in blobs.pages:
    print("Subfolders on Page: ", page.prefixes)

Sharing this answer if this helps anyone.如果这对任何人有帮助，请分享此答案。

Answer 5

It was a bit confusing, but I found the answer here 这有点令人困惑，但我在这里找到了答案

https://googlecloudplatform.github.io/google-cloud-python/latest/iterators.html https://googlecloudplatform.github.io/google-cloud-python/latest/iterators.html

You can iterate through the pages and call the items needed 您可以遍历页面并调用所需的项目

iterator=self.bucket.list_blobs()        

self.get_files=[]        
for page in iterator.pages:
    print('    Page number: %d' % (iterator.page_number,))
    print('  Items in page: %d' % (page.num_items,))
    print('     First item: %r' % (next(page),))
    print('Items remaining: %d' % (page.remaining,))
    print('Next page token: %s' % (iterator.next_page_token,))        
    for f in page:
        self.get_files.append("gs://" + f.bucket.name + "/" + f.name)

print( "Found %d results" % (len( self.get_files)))

Google Cloud Storage Python Client Library 中的 list_blobs function 中的分页是如何工作的

问题描述

5 个解决方案

解决方案1
9 已采纳 2017-04-27 00:10:28

How It Works: 这个怎么运作：

The Documentation is Wrong/Misleading: 文档错误/误导：

Helpful To Know: 有用的知道：

解决方案2
1 2019-09-30 17:01:12

解决方案3
1 2022-03-24 22:02:45

解决方案4
0 2022-03-31 23:23:19

解决方案5
-1 2017-04-26 00:50:10

Google Cloud Storage Python Client Library 中的 list_blobs function 中的分页是如何工作的

问题描述

5 个解决方案

解决方案1 9 已采纳 2017-04-27 00:10:28

How It Works: 这个怎么运作：

The Documentation is Wrong/Misleading: 文档错误/误导：

Helpful To Know: 有用的知道：

解决方案2 1 2019-09-30 17:01:12

解决方案3 1 2022-03-24 22:02:45

解决方案4 0 2022-03-31 23:23:19

解决方案5 -1 2017-04-26 00:50:10

解决方案1
9 已采纳 2017-04-27 00:10:28

解决方案2
1 2019-09-30 17:01:12

解决方案3
1 2022-03-24 22:02:45

解决方案4
0 2022-03-31 23:23:19

解决方案5
-1 2017-04-26 00:50:10