简体   繁体   English

Google Cloud Storage Python Client Library 中的 list_blobs function 中的分页是如何工作的

[英]How does paging work in the list_blobs function in Google Cloud Storage Python Client Library

I want to get a list of all the blobs in a Google Cloud Storage bucket using the Client Library for Python .我想使用Python 的客户端库获取Google Cloud Storage 存储桶中所有 blob 的列表。

According to the documentation I should use the list_blobs() function. The function appears to use two arguments max_results and page_token to achieve paging.根据文档我应该使用list_blobs() function。function似乎使用两个arguments max_resultspage_token来实现分页。 I am not sure how use them.我不确定如何使用它们。

In particular, where do I get the page_token from?特别是,我从哪里获得page_token

I would have expected that list_blobs() would provide a page_token for use in subsequent calls, but I cannot find any documentation on it.我原以为list_blobs()会提供一个page_token供后续调用使用,但我找不到任何关于它的文档。

In addition, max_results is optional.此外, max_results是可选的。 What happens if I don't provide it?如果我不提供会怎样? Is there a default limit?有默认限制吗? If so, what is it?如果是这样,它是什么?

list_blobs() does use paging, but you do not use page_token to achieve it. list_blobs()确实使用分页,但你不使用page_token来实现它。

How It Works: 这个怎么运作:

The way list_blobs() work is that it returns an iterator that iterates through all the results doing paging behind the scenes . list_blobs()工作的方式是它返回一个迭代器,迭代遍历在幕后进行分页的 所有结果 So simply doing this will get you through all the results, fetching pages as needed: 因此,只需执行此操作即可完成所有结果,根据需要获取页面:

for blob in bucket.list_blobs()
    print blob.name

The Documentation is Wrong/Misleading: 文档错误/误导:

As of 04/26/2017 this is what the docs says: 截至2017年4月26日,这是文档所说的内容:

page_token (str) – (Optional) Opaque marker for the next “page” of blobs. page_token (str) - (可选)blob的下一个“页面”的不透明标记。 If not passed, will return the first page of blobs. 如果没有通过,将返回blob的第一页。

This implies that the result will be a single page of results with page_token determining which page. 这意味着结果将是单页结果, page_token确定哪个页面。 This is not correct. 这是不正确的。 The result iterator iterates through multiple pages. 结果迭代器遍历多个页面。 What page_token actually represents is which page the iterator should START at. page_token实际代表的是迭代器应该从哪个页面开始 It no page_token is provided it will start at the first page. 没有提供page_token它将从第一页开始。

Helpful To Know: 有用的知道:

max_results limits the total number of results returned by the iterator. max_results限制迭代器返回的结果总数。

The iterator does expose pages if you need it: 如果需要,迭代器会公开页面:

for page in bucket.list_blobs().pages:
    for blob in page:
        print blob.name

I'm just going to leave this here.我要把这个留在这里。 I'm not sure if the libraries have changes in the last 2 years since this answer was posted, but if you're using prefix, then for blob in bucket.list_blobs() doesn't work right.我不确定自发布此答案以来库在过去 2 年中是否有变化,但如果您使用前缀,则for blob in bucket.list_blobs()无法正常工作。 It seems like getting blobs and getting prefixes are fundamentally different.看起来获取 blob 和获取前缀是根本不同的。 And using pages with prefixes is confusing.使用带前缀的页面会让人感到困惑。

I found a post in a github issue ( here ).我在 github 问题( 此处)中找到了一篇文章。 This works for me.这对我有用。

def list_gcs_directories(bucket, prefix):
    # from https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print page, page.prefixes
        prefixes.update(page.prefixes)
    return prefixes

A different comment on the same issue suggested this:对同一问题的不同评论表明:

def get_prefixes(bucket):
    iterator = bucket.list_blobs(delimiter="/")
    response = iterator._get_next_page_response()
    return response['prefixes']

Which only gives you the prefixes if all of your results fit on a single page.如果您的所有结果都放在一个页面上,它只会给您前缀。

Please read the inline comments:请阅读内联评论:

from google.cloud import storage

storage = storage.Client()

bucket_name = ''  # Fill here your bucket name

# This will limit number of results - replace this with None in order to get all the blobs in the bucket
max_results = 23_344 

# Please specify the "nextPageToken" in order to trigger an implicit pagination 
# (which is managed for you by the library).
# Moreover, you'll need to specify the "items" with all the fields you would like to fetch.
# Here are the supported fields: https://cloud.google.com/storage/docs/json_api/v1/objects#resource

fields = 'items(name),nextPageToken'

counter = 0
for blob in storage.list_blobs(bucket_name, fields=fields, max_results=max_results):
    counter += 1
    print(counter, ')', blob.name)

I wanted to extract only the subfolders out of a GCP bucket.我只想从 GCP 存储桶中提取子文件夹。 Some of the other methods do work but only if they all fit in a particular page.其他一些方法确实有效,但前提是它们都适合特定页面。 I found something like the following that works for me:我发现类似以下对我有用的东西:

blobs = self.gcp_client.list_blobs(bucket_name, prefix='subfolder/', delimiter='/')

for page in blobs.pages:
    print("Subfolders on Page: ", page.prefixes)

Sharing this answer if this helps anyone.如果这对任何人有帮助,请分享此答案。

It was a bit confusing, but I found the answer here 这有点令人困惑,但我在这里找到了答案

https://googlecloudplatform.github.io/google-cloud-python/latest/iterators.html https://googlecloudplatform.github.io/google-cloud-python/latest/iterators.html

You can iterate through the pages and call the items needed 您可以遍历页面并调用所需的项目

iterator=self.bucket.list_blobs()        

self.get_files=[]        
for page in iterator.pages:
    print('    Page number: %d' % (iterator.page_number,))
    print('  Items in page: %d' % (page.num_items,))
    print('     First item: %r' % (next(page),))
    print('Items remaining: %d' % (page.remaining,))
    print('Next page token: %s' % (iterator.next_page_token,))        
    for f in page:
        self.get_files.append("gs://" + f.bucket.name + "/" + f.name)

print( "Found %d results" % (len( self.get_files))) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 GCP 存储 list_blobs 带前缀并排除特定子文件夹 - GCP storage list_blobs with prefix and exclude a specific subfolder 如何让 list_blob 表现得像 gsutil - How to get list_blobs to behave like gsutil 谷歌云 function python 无法与 blob 交互 - Google cloud function python cannot interact with blobs 列出 Google Cloud Storage 目录中的 blob 返回错误结果 - List blobs in a directory on Google Cloud Storage returns wrong result 如何使用服务帐户凭据初始化 Google Cloud Storage NodeJS 客户端库 - How to init Google Cloud Storage NodeJS client library with service account credentials 将使用 google-api-python-client 库的 Python 脚本部署为 Google Cloud Function - Deploying Python script that use google-api-python-client library as Google Cloud Function 如何使用Python的Google Cloud Client Library配置gcloud项目 - How to use Google Cloud Client Library for Python to configure the gcloud project 用于上传到谷歌云存储的子进程 CALL 或 python 库? - Subprocess CALL or python library for uploading to Google Cloud Storage? 如何将文件上传到 Python 3 上的 Google Cloud Storage? - How to upload a file to Google Cloud Storage on Python 3? Google Cloud Storage 是否适用于我使用 PySimpleGUI 编写的应用程序? - Does Google Cloud Storage work for the app I coded using PySimpleGUI?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM