如何列出给定 GCS 存储桶中的所有顶级文件夹？

Question

I start with我从

    client = storage.Client()
    bucket = client.get_bucket(BUCKET_NAME)

    <what's next? Need something like client.list_folders(path)>

I know how to:我知道如何：

list all the blobs (including blobs in sub-sub-sub-folders, of any depth) with bucket.list_blobs()使用bucket.list_blobs()
or how to list all the blobs recursively in given folder with bucket.list_blobs(prefix=<path to subfolder>)或如何使用bucket.list_blobs(prefix=<path to subfolder>)递归地列出给定文件夹中的所有 blob

but what if my file system structure has 100 top level folders, each having thousands of files.但是如果我的文件系统结构有100顶级文件夹，每个文件夹都有数千个文件。 Any efficient way to get only those 100 top level folder names without listing all the inside blobs?有什么有效的方法可以只获取这100顶级文件夹名称而不列出所有内部 blob？

Answer 1

All the response here have a piece of the answer but you will need to combine: the prefix , the delimiter and the prefixes in a loaded list_blobs(...) iterator.这里的所有响应都有一个答案，但您需要组合：加载的list_blobs(...)迭代器中的prefix 、 delimiter和prefixes 。 Let me throw down the code to get the 100 top level folders and then we'll walk through it.让我放下代码来获取 100 个顶级文件夹，然后我们将逐步介绍它。

import google.cloud.storage as gcs
client = gcs.Client()
blobs = client.list_blobs(
    bucket_or_name=BUCKET_NAME, 
    prefix="", 
    delimiter="/", 
    max_results=1
)
next(blobs, ...) # Force list_blobs to make the api call (lazy loading)
# prefixes is now a set, convert to list
print(list(blobs.prefixes)[:100])

In first eight lines we build the GCS client and make the client.list_blobs(...) call.在前八行中，我们构建 GCS 客户端并调用client.list_blobs(...) 。 In your question you mention the bucket.list_blobs(..) method - as of version 1.43 this still works but the page on Buckets in the docs say this is now deprecated.在您的问题中，您提到了bucket.list_blobs(..)方法 - 从 1.43 版开始，这仍然有效，但文档中的 Buckets 页面说现在已弃用。 The only difference is the keword arg bucket_or_name , on line 4.唯一的区别是第 4 行的关键字 arg bucket_or_name 。

We want folders at the top level, so we don't actually need to specify prefix at all, however, it will be useful for other readers to know that if you had wanted to list folders in a top-level directory stuff then you should specify a trailing slash.我们希望文件夹位于顶层，因此我们实际上根本不需要指定prefix ，但是，其他读者知道如果您想在顶层目录中stuff文件夹，那么您应该指定尾部斜杠。 This kwarg would then become prefix="stuff/" .然后这个 kwarg 会变成prefix="stuff/" 。

Someone already mentioned the delimiter kwarg, but to iterate, you should specify this so GCS knows how to interpret the blob names as directories.有人已经提到了delimiter kwarg，但是要进行迭代，您应该指定它，以便 GCS 知道如何将 blob 名称解释为目录。 Simple enough.很简单。

The max_results=1 is for efficiency. max_results=1是为了提高效率。 Remember that we don't want blobs here, we want only folder names.请记住，我们在这里不需要 blob，我们只需要文件夹名称。 Therefore if we tell GCS to stop looking once it finds a single blob, it might be faster.因此，如果我们告诉 GCS 一旦找到单个 blob 就停止查找，它可能会更快。 In practice, I have not found this to be the case but it could easily be if you have vast numbers of blobs, or if the storage is cold-line or whatever.在实践中，我没有发现这种情况，但如果你有大量的 blob，或者存储是冷线或其他什么，它很容易出现。 YMMV. YMMV。 Consider it optional.认为它是可选的。

The blobs object returned is an lazy-loading iterator, which means that it won't load - including not even populating its members - until the first api call is made.返回的blobs对象是一个延迟加载迭代器，这意味着它不会加载 - 甚至不填充其成员 - 直到进行第一次 api 调用。 To get this first call, we ask for the next element in the iterator.为了获得第一次调用，我们要求迭代器中的next元素。 In your case, you know you have at least one file, so simply calling next(blobs) will work.在您的情况下，您知道您至少有一个文件，因此只需调用next(blobs)即可。 It fetches the blob that is next in line (at the front of the line) and then throws it away.它获取下一行（在行前）的 blob，然后将其丢弃。

However, if you could not guarantee to have at least one blob, then next(blobs) , which needs to return something from the interator, will raise a StopIteration exception.但是，如果您不能保证至少有一个 blob，则需要从交互器返回某些内容的next(blobs)将引发StopIteration异常。 To get round this, we put the default value of the ellipsis ... .为了解决这个问题，我们将省略号的默认值设置为... 。

Now the member of blobs we want, prefixes , is loaded, we print out the first 100. The output will be something like:现在我们想要的blobs成员prefixes已加载，我们打印出前 100 个。输出将类似于：

{'dir0/','dir1/','dir2/', ...}

Answer 2

I do not think you can get the 100 top level folders without listing all the inside blobs.如果不列出所有内部 blob，我认为您无法获得100 个顶级文件夹。 Google Cloud Storage does not have folders or subdirectories, the library just creates an illusion of a hierarchical file tree. Google Cloud Storage 没有文件夹或子目录，该库只是创建了分层文件树的错觉。

I used this simple code :我使用了这个简单的代码：

from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs('my-project')
res = []

for blob in blobs:
   if blob.name.split('/')[0] not in res:
       res.append(blob.name.split('/')[0]) 

print(res)

Answer 3

You can get the top-level prefixes by using a delimited listing.您可以使用分隔列表获取顶级前缀。 See the list_blobs documentation:请参阅list_blob文档：

delimiter (str) – (Optional) Delimiter, used with prefix to emulate hierarchy. delimiter (str) - （可选）分隔符，与前缀一起使用以模拟层次结构。

Something like this:像这样的东西：

from google.cloud import storage
storage_client = storage.Client()
storage_client.list_blobs(BUCKET_NAME, delimiter='/')

如何列出给定 GCS 存储桶中的所有顶级文件夹？

问题描述

3 个解决方案

解决方案1
5 2021-10-19 09:19:58

解决方案2
3 已采纳 2019-12-30 08:49:12

解决方案3
-1 2019-12-30 17:53:28

如何列出给定 GCS 存储桶中的所有顶级文件夹？

问题描述

3 个解决方案

解决方案1 5 2021-10-19 09:19:58

解决方案2 3 已采纳 2019-12-30 08:49:12

解决方案3 -1 2019-12-30 17:53:28

解决方案1
5 2021-10-19 09:19:58

解决方案2
3 已采纳 2019-12-30 08:49:12

解决方案3
-1 2019-12-30 17:53:28