[英]How does paging work in the list_blobs function in Google Cloud Storage Python Client Library
I want to get a list of all the blobs in a Google Cloud Storage bucket using the Client Library for Python .我想使用Python 的客户端库获取Google Cloud Storage 存储桶中所有 blob 的列表。
According to the documentation I should use the list_blobs()
function. The function appears to use two arguments max_results
and page_token
to achieve paging.根据文档我应该使用
list_blobs()
function。function似乎使用两个arguments max_results
和page_token
来实现分页。 I am not sure how use them.我不确定如何使用它们。
In particular, where do I get the page_token
from?特别是,我从哪里获得
page_token
?
I would have expected that list_blobs()
would provide a page_token
for use in subsequent calls, but I cannot find any documentation on it.我原以为
list_blobs()
会提供一个page_token
供后续调用使用,但我找不到任何关于它的文档。
In addition, max_results
is optional.此外,
max_results
是可选的。 What happens if I don't provide it?如果我不提供会怎样? Is there a default limit?
有默认限制吗? If so, what is it?
如果是这样,它是什么?
list_blobs()
does use paging, but you do not use page_token
to achieve it. list_blobs()
确实使用分页,但你不使用page_token
来实现它。
The way list_blobs()
work is that it returns an iterator that iterates through all the results doing paging behind the scenes . list_blobs()
工作的方式是它返回一个迭代器,迭代遍历在幕后进行分页的 所有结果 。 So simply doing this will get you through all the results, fetching pages as needed: 因此,只需执行此操作即可完成所有结果,根据需要获取页面:
for blob in bucket.list_blobs()
print blob.name
As of 04/26/2017 this is what the docs says: 截至2017年4月26日,这是文档所说的内容:
page_token
(str) – (Optional) Opaque marker for the next “page” of blobs.page_token
(str) - (可选)blob的下一个“页面”的不透明标记。 If not passed, will return the first page of blobs.如果没有通过,将返回blob的第一页。
This implies that the result will be a single page of results with page_token
determining which page. 这意味着结果将是单页结果,
page_token
确定哪个页面。 This is not correct. 这是不正确的。 The result iterator iterates through multiple pages.
结果迭代器遍历多个页面。 What
page_token
actually represents is which page the iterator should START at. page_token
实际代表的是迭代器应该从哪个页面开始 。 It no page_token
is provided it will start at the first page. 没有提供
page_token
它将从第一页开始。
max_results
limits the total number of results returned by the iterator. max_results
限制迭代器返回的结果总数。
The iterator does expose pages if you need it: 如果需要,迭代器会公开页面:
for page in bucket.list_blobs().pages:
for blob in page:
print blob.name
I'm just going to leave this here.我要把这个留在这里。 I'm not sure if the libraries have changes in the last 2 years since this answer was posted, but if you're using prefix, then
for blob in bucket.list_blobs()
doesn't work right.我不确定自发布此答案以来库在过去 2 年中是否有变化,但如果您使用前缀,则
for blob in bucket.list_blobs()
无法正常工作。 It seems like getting blobs and getting prefixes are fundamentally different.看起来获取 blob 和获取前缀是根本不同的。 And using pages with prefixes is confusing.
使用带前缀的页面会让人感到困惑。
I found a post in a github issue ( here ).我在 github 问题( 此处)中找到了一篇文章。 This works for me.
这对我有用。
def list_gcs_directories(bucket, prefix):
# from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
prefixes = set()
for page in iterator.pages:
print page, page.prefixes
prefixes.update(page.prefixes)
return prefixes
A different comment on the same issue suggested this:对同一问题的不同评论表明:
def get_prefixes(bucket):
iterator = bucket.list_blobs(delimiter="/")
response = iterator._get_next_page_response()
return response['prefixes']
Which only gives you the prefixes if all of your results fit on a single page.如果您的所有结果都放在一个页面上,它只会给您前缀。
Please read the inline comments:请阅读内联评论:
from google.cloud import storage
storage = storage.Client()
bucket_name = '' # Fill here your bucket name
# This will limit number of results - replace this with None in order to get all the blobs in the bucket
max_results = 23_344
# Please specify the "nextPageToken" in order to trigger an implicit pagination
# (which is managed for you by the library).
# Moreover, you'll need to specify the "items" with all the fields you would like to fetch.
# Here are the supported fields: https://cloud.google.com/storage/docs/json_api/v1/objects#resource
fields = 'items(name),nextPageToken'
counter = 0
for blob in storage.list_blobs(bucket_name, fields=fields, max_results=max_results):
counter += 1
print(counter, ')', blob.name)
I wanted to extract only the subfolders out of a GCP bucket.我只想从 GCP 存储桶中提取子文件夹。 Some of the other methods do work but only if they all fit in a particular page.
其他一些方法确实有效,但前提是它们都适合特定页面。 I found something like the following that works for me:
我发现类似以下对我有用的东西:
blobs = self.gcp_client.list_blobs(bucket_name, prefix='subfolder/', delimiter='/')
for page in blobs.pages:
print("Subfolders on Page: ", page.prefixes)
Sharing this answer if this helps anyone.如果这对任何人有帮助,请分享此答案。
It was a bit confusing, but I found the answer here 这有点令人困惑,但我在这里找到了答案
https://googlecloudplatform.github.io/google-cloud-python/latest/iterators.html https://googlecloudplatform.github.io/google-cloud-python/latest/iterators.html
You can iterate through the pages and call the items needed 您可以遍历页面并调用所需的项目
iterator=self.bucket.list_blobs()
self.get_files=[]
for page in iterator.pages:
print(' Page number: %d' % (iterator.page_number,))
print(' Items in page: %d' % (page.num_items,))
print(' First item: %r' % (next(page),))
print('Items remaining: %d' % (page.remaining,))
print('Next page token: %s' % (iterator.next_page_token,))
for f in page:
self.get_files.append("gs://" + f.bucket.name + "/" + f.name)
print( "Found %d results" % (len( self.get_files)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.