简体   繁体   English

如何使用 Google Cloud API 获取给定存储桶中的文件夹列表

[英]How to get list of folders in a given bucket using Google Cloud API

I wanted to get all the folders inside a given Google Cloud bucket or folder using Google Cloud Storage API.我想使用 Google Cloud Storage API 获取给定 Google Cloud 存储桶或文件夹中的所有文件夹。

For example if gs://abc/xyz contains three folders gs://abc/xyz/x1 , gs://abc/xyz/x2 and gs://abc/xyz/x3 .例如,如果gs://abc/xyz包含三个文件夹gs://abc/xyz/x1gs://abc/xyz/x2gs://abc/xyz/x3 The API should return all three folder in gs://abc/xyz . API 应该返回gs://abc/xyz中的所有三个文件夹。

It can easily be done using gsutil使用gsutil可以轻松完成

gsutil ls gs://abc/xyz

But I need to do it using python and Google Cloud Storage API.但我需要使用 python 和 Google Cloud Storage API 来完成。

This question is about listing the folders inside a bucket/folder.这个问题是关于上市桶/文件夹内的文件夹 None of the suggestions worked for me and after experimenting with the google.cloud.storage SDK, I suspect it is not possible (as of November 2019) to list the sub-directories of any path in a bucket.没有任何建议对我google.cloud.storage ,在尝试了google.cloud.storage SDK 之后,我怀疑(截至 2019 年 11 月)不可能列出存储桶中任何路径的子目录。 It is possible with the REST API, so I wrote this little wrapper... REST API 是可能的,所以我写了这个小包装器......

from google.api_core import page_iterator
from google.cloud import storage

def _item_to_value(iterator, item):
    return item

def list_directories(bucket_name, prefix):
    if prefix and not prefix.endswith('/'):
        prefix += '/'

    extra_params = {
        "projection": "noAcl",
        "prefix": prefix,
        "delimiter": '/'
    }

    gcs = storage.Client()

    path = "/b/" + bucket_name + "/o"

    iterator = page_iterator.HTTPIterator(
        client=gcs,
        api_request=gcs._connection.api_request,
        path=path,
        items_key='prefixes',
        item_to_value=_item_to_value,
        extra_params=extra_params,
    )

    return [x for x in iterator]

For example, if you have my-bucket containing:例如,如果您的my-bucket包含:

  • dog-bark狗吠
    • datasets数据集
      • v1 v1
      • v2 v2

Then calling list_directories('my-bucket', 'dog-bark/datasets') will return:然后调用list_directories('my-bucket', 'dog-bark/datasets')将返回:

['dog-bark/datasets/v1', 'dog-bark/datasets/v2']

You can use the Python GCS API Client Library.您可以使用 Python GCS API 客户端库。 See the Samples and Libraries for Google Cloud Storage documentation page for relevant links to documentation and downloads.有关文档和下载的相关链接,请参阅Google Cloud Storage文档页面的示例和库

In your case, first I want to point out that you're confusing the term "bucket".就您而言,首先我想指出您混淆了“存储桶”一词。 I recommend reading the Key Terms page of the documentation.我建议阅读文档的关键术语页面。 What you're talking about are object name prefixes.你在谈论的是对象名称前缀。

You can start with the list-objects.py sample on GitHub.您可以从 GitHub 上的list-objects.py示例开始。 Looking at the list reference page, you'll want to pass bucket=abc , prefix=xyz/ and delimiter=/ .查看列表参考页面,您需要传递bucket=abcprefix=xyz/delimiter=/

Here's an update to this answer thread:这是此答案线程的更新:

from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# Get GCS bucket
bucket = storage_client.get_bucket(bucket_name)

# Get blobs in bucket (including all subdirectories)
blobs_all = list(bucket.list_blobs())

# Get blobs in specific subirectory
blobs_specific = list(bucket.list_blobs(prefix='path/to/subfolder/'))

I also need to simply list the contents of a bucket.我还需要简单地列出存储桶的内容。 Ideally I would like something similar to what tf.gfile provides.理想情况下,我想要类似于 tf.gfile 提供的内容。 tf.gfile has support for determining if an entry is a file or a directory. tf.gfile 支持确定条目是文件还是目录。

I tried the various links provided by @jterrace above but my results were not optimal.我尝试了上面@jterrace 提供的各种链接,但我的结果并不理想。 With that said its worth showing the results.话虽如此,但值得展示结果。

Given a bucket which has a mix of "directories" and "files" its hard to navigate the "filesystem" to find items of interest.给定一个混合了“目录”和“文件”的存储桶,很难在“文件系统”中导航以找到感兴趣的项目。 I've provided some comments in the code on how the code referenced above works.我在代码中提供了一些关于上面引用的代码如何工作的注释。

In either case, I am using a datalab notebook with credentials included by the notebook.在任何一种情况下,我都使用带有凭据的 datalab 笔记本。 Given the results, I will need to use string parsing to determine which files are in a particular directory.鉴于结果,我将需要使用字符串解析来确定特定目录中的文件。 If anyone knows how to expand these methods or an alternate method to parse the directories similar to tf.gfile, please reply.如果有人知道如何扩展这些方法或其他方法来解析类似于 tf.gfile 的目录,请回复。

Method One方法一

import sys
import json
import argparse
import googleapiclient.discovery

BUCKET = 'bucket-sounds' 

def create_service():
    return googleapiclient.discovery.build('storage', 'v1')


def list_bucket(bucket):
    """Returns a list of metadata of the objects within the given bucket."""
    service = create_service()

    # Create a request to objects.list to retrieve a list of objects.
    fields_to_return = 'nextPageToken,items(name,size,contentType,metadata(my-key))'
    #req = service.objects().list(bucket=bucket, fields=fields_to_return)  # returns everything
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound')  # returns everything. UrbanSound is top dir in bucket
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREE') # returns the file FREESOUNDCREDITS.TXT
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREESOUNDCREDITS.txt', delimiter='/') # same as above
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark', delimiter='/') # returns nothing
    req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark/', delimiter='/') # returns files in dog_bark dir

    all_objects = []
    # If you have too many items to list in one request, list_next() will
    # automatically handle paging with the pageToken.
    while req:
        resp = req.execute()
        all_objects.extend(resp.get('items', []))
        req = service.objects().list_next(req, resp)
    return all_objects

# usage
print(json.dumps(list_bucket(BUCKET), indent=2))

This generates results like this:这会生成如下结果:

[
  {
    "contentType": "text/csv", 
    "name": "UrbanSound/data/dog_bark/100032.csv", 
    "size": "29"
  }, 
  {
    "contentType": "application/json", 
    "name": "UrbanSound/data/dog_bark/100032.json", 
    "size": "1858"
  } stuff snipped]

Method Two方法二

import re
import sys
from google.cloud import storage

BUCKET = 'bucket-sounds'

# Create a Cloud Storage client.
gcs = storage.Client()

# Get the bucket that the file will be uploaded to.
bucket = gcs.get_bucket(BUCKET)

def my_list_bucket(bucket_name, limit=sys.maxsize):
  a_bucket = gcs.lookup_bucket(bucket_name)
  bucket_iterator = a_bucket.list_blobs()
  for resource in bucket_iterator:
    print(resource.name)
    limit = limit - 1
    if limit <= 0:
      break

my_list_bucket(BUCKET, limit=5)

This generates output like this.这会生成这样的输出。

UrbanSound/FREESOUNDCREDITS.txt
UrbanSound/UrbanSound_README.txt
UrbanSound/data/air_conditioner/100852.csv
UrbanSound/data/air_conditioner/100852.json
UrbanSound/data/air_conditioner/100852.mp3

To get a list of folders in a bucket, you can use the code snippet below:要获取存储桶中的文件夹列表,您可以使用以下代码片段:

import googleapiclient.discovery


def list_sub_directories(bucket_name, prefix):
    """Returns a list of sub-directories within the given bucket."""
    service = googleapiclient.discovery.build('storage', 'v1')

    req = service.objects().list(bucket=bucket_name, prefix=prefix, delimiter='/')
    res = req.execute()
    return res['prefixes']

# For the example (gs://abc/xyz), bucket_name is 'abc' and the prefix would be 'xyz/'
print(list_sub_directories(bucket_name='abc', prefix='xyz/'))

I faced the same issue and managed to get it done by using the standard list_blobs described here :我遇到了同样的问题,并通过使用此处描述的标准 list_blobs 设法完成了它:

from google.cloud import storage

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(
    bucket_name, prefix=prefix, delimiter=delimiter
)

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

However, this only worked for me after I read AntPhitlok answer , and understood I have to make sure my prefix ends with / and I'm also using / as a delimiter.但是,这仅在我阅读 AntPhitlok answer后才对我有用,并且我明白我必须确保我的前缀以/结尾,并且我还使用/作为分隔符。

Due to that fact, under the 'Blobs:' section, we will only get file names, not folders, if exist under the prefix folder.因此,在“Blob:”部分下,如果前缀文件夹下存在文件名,我们将只获取文件名,而不是文件夹。 All the sub-directories will be listed under the 'Prefixes:' section.所有子目录都将列在“前缀:”部分下。

It's important to note that blobs is in fact an iterator, so in order to get the sub-directories, we must "open" it.需要注意的是, blobs实际上是一个迭代器,所以为了获得子目录,我们必须“打开”它。 Therefore, leaving out the 'Blobs:' section from our code, will result in an empty set() inside blobs.prefixes因此,从我们的代码中省略 'Blobs:' 部分,将导致blobs.prefixes中的set()为空

Edit: An example of usage - say I have a bucket named buck and inside it a directory named dir .编辑:使用示例 - 假设我有一个名为buck的存储桶,其中有一个名为dir inside dir I have another directory named subdir .dir我有另一个名为subdir目录。

In order to list the directories inside dir , I can use:为了列出dir ,我可以使用:

from google.cloud import storage

storage_client = storage.Client()
blobs = storage_client.list_blobs('buck', prefix='dir/', delimiter='/')

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

*Notice the / at the end of prefix and as the delimiter. *注意前缀末尾的/作为分隔符。

This call will print me the following:此调用将打印以下内容:

Prefixes:
subdir/
# sudo pip3 install --upgrade google-cloud-storage
from google.cloud import storage

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "./key.json"
storage_client = storage.Client()
bucket = storage_client.get_bucket("my-bucket")
blobs = list(bucket.list_blobs(prefix='dir/'))
print (blobs)
#python notebook
ret_folders = !gsutil ls $path_possible_with_regex | grep -e "/$"
ret_folders_no_subdir = [x for x in srr_folders if x.split("/")[-2] != "SUBDIR”]

you can edit the condition to whatever works for you.您可以将条件编辑为适合您的任何条件。 in my case, I wanted only the deeper “folders”.就我而言,我只想要更深的“文件夹”。 for the save level folders, you can replace with对于保存级别文件夹,您可以替换为

 x.split("/")[-2] == "SUBDIR”

Here's a simple way to get all subfolders:这是获取所有子文件夹的简单方法:

from google.cloud import storage


def get_subdirs(bucket_name, dir_name=None):
    """
    List all subdirectories for a bucket or
    a specific folder in a bucket. If `dir_name`
    is left blank, it will list all directories in the bucket.
    """
    client = storage.Client()
    bucket = client.lookup_bucket(bucket_name)

    all_folders = []
    for resource in bucket.list_blobs(prefix=dir_name):

        # filter for directories only
        n = resource.name
        if n.endswith("/"):
            all_folders.append(n)

    return all_folders

# Use as follows:
all_folders = get_subdirs("my-bucket")

here is a simple solution这是一个简单的解决方案

from google.cloud import storage # !pip install --upgrade google-cloud-storage
import os

# set up your bucket 
client = storage.Client()
storage_client = storage.Client.from_service_account_json('XXXXXXXX')
bucket = client.get_bucket('XXXXXXXX')

# get all the folder in folder "base_folder"
base_folder = 'model_testing'
blobs=list(bucket.list_blobs(prefix=base_folder))
folders = list(set([os.path.dirname(k.name) for k in blobs]))
print(*folders, sep = '\n')

if you want only the folders in the selected folder如果您只想要所选文件夹中的文件夹

base_folder = base_folder.rstrip(os.sep) # needed to remove any slashes at the end of the string 
one_out = list(set([base_folder+ os.sep.join(k.split(base_folder)[-1].split(os.sep)[:2]) for k in folders]))
print(*one_out, sep = '\n')

of course instead of using当然而不是使用

list(set())

you could use numpy你可以使用 numpy

import numpy as np
np.unique()

1. Get access to your client object. 1. 访问您的客户端对象。

Where is the code running?代码在哪里运行?

I am (somewhere) inside the Google Cloud Platform (GCP)我在 Google Cloud Platform (GCP) 中(某处)

If you are accessing Google Cloud Storage (GCS) from inside GCP, for example Google Kubernetes Engine (GKE), you should use a workload identity to configure your GKE service account to act as a GCS service account.如果您从 GCP 内部访问 Google Cloud Storage (GCS),例如 Google Kubernetes Engine (GKE),您应该使用工作负载身份将您的 GKE 服务帐户配置为充当 GCS 服务帐户。 https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity

Once you do this creating your client is as easy as一旦你这样做了,创建你的客户就像

import google.cloud.storage as gcs
client = gcs.Client()

Out in the wild在野外

If you are somewhere else: AWS, Azure, your dev machine, or otherwise outside GCP, then you need to choose between creating a service account key that you download (it's a json file with a cryptographic PRIVATE KEY in it) or by using a workload identity federation , such as provided by AWS, Azure and "friends".如果您在其他地方:AWS、Azure、您的开发机器或 GCP 之外的其他地方,那么您需要选择创建您下载的服务帐户密钥(它是一个包含加密私钥的 json 文件)或使用工作负载身份联合,例如由 AWS、Azure 和“朋友”提供。

Let's assume you have decided to download the new GCS service account file to /secure/gcs.json .假设您已决定将新的 GCS 服务帐户文件下载到/secure/gcs.json

PROJECT_NAME = "MY-GCP-PROJECT"
from google.oauth2.service_account import Credentials
import google.cloud.storage as gcs
client = gcs.Client(
    project=PROJECT_NAME,
    credentials=Credentials.from_service_account_file("/secure/gcs.json"),
)

2. Make the list-folders request to GCS 2. 向 GCS 发出 list-folders 请求

In the OP, we are trying to get the folders inside path xyz in bucket abc .在 OP 中,我们试图在存储桶abc获取路径xyz内的文件夹。 Note that paths in GCS, unlike Linux, do not start with a / , however, they should finish with one.请注意,GCS 中的路径与 Linux 不同,不以/开头,但是,它们应该以 1 结尾。 So we will be looking for folders with the prefix xyz/ .所以我们将寻找前缀为xyz/文件夹。 That is simply folders, not folders and all of their subfolders.那只是文件夹,而不是文件夹及其所有子文件夹。

BUCKET_NAME = "abc"
blobs = client.list_blobs(
    BUCKET_NAME,
    prefix="xyz/",  # <- you need the trailing slash
    delimiter="/",
    max_results=1,
)

Note how we have asked for no more than a single blob.请注意我们如何要求不超过一个 blob。 This is not a mistake: the blobs are the files themselves - we're only interested in folders.这不是错误:blob 就是文件本身——我们只对文件夹感兴趣。 Setting max_results to zero doesn't work, see below.max_results设置为零不起作用,请参见下文。

3. Force the lazy-loading to...err..load! 3. 强制延迟加载...错误...加载!

Several of the answers up here have looped through every element in the iterator blobs , which could me many millions, but we don't need to do that.上面的几个答案已经遍历了迭代器blobs中的每个元素,这可能是数百万个,但我们不需要这样做。 That said, if we don't loop through any elements, blobs won't bother making the api request to GCS at all.也就是说,如果我们不循环遍历任何元素, blobs根本不会费心向 GCS 发出 api 请求。

next(blobs, ...) # Force blobs to load.
print(blobs.prefixes)

The blobs variable is an iterator with at most one element, but, if your folder has no files in it (at its level) then there may be zero elements. blobs变量是一个最多包含一个元素的迭代器,但是,如果您的文件夹中没有文件(在其级别),则可能有零个元素。 If there are zero elements, then next(blobs) will raise a StopIteration .如果有零个元素,则next(blobs)将引发一个StopIteration

The second argument, the ellipsis ... , is simply my choice of default return value, should there be no next element.第二个参数,省略号... ,只是我选择的默认返回值,如果没有next元素。 I feel this is more readable than, say, None , because it suggests to the reader that something worth noticing is happening here.我觉得这比None更具可读性,因为它向读者表明这里正在发生一些值得注意的事情。 After all, code that requests a value only to discard it on the same line does have all the hallmarks of a potential bug, so it is good to reassure our reader that this is deliberate.毕竟,请求一个值只是为了在同一行上丢弃它的代码确实具有潜在错误的所有特征,所以向我们的读者保证这是故意的。

Finally, suppose we have a tree structure under xyz of aaa , bbb , ccc , and then under ccc we have subsubfolder zzz .最后,假设我们在aaabbbccc xyz下有一个树结构,然后在ccc下我们有子文件夹zzz The output will then be然后输出将是

{'xyz/aaa', 'xyz/bbb', 'xyz/ccc'}

Note that, as required in OP, we do not see subsubfolder xyz/ccc/zzz .请注意,根据 OP 的要求,我们看不到子文件夹xyz/ccc/zzz

In case anyone does not want to go through the learning curve of google-cloud-api you can basically use the subprocess module to run bash commands:如果有人不想通过google-cloud-api的学习曲线,您基本上可以使用subprocess模块来运行 bash 命令:

import subprocess
out=subprocess.run(["gsutil","ls","path/to/some/folder/"], capture_output=True)
out_list = out.stdout.decode("utf-8").split("\n")
dir_list = [i for i in out_list if i.endswith("/")]
files_list = [i for i in out_list if i not in dir_list]

Following https://stackoverflow.com/users/2350164/yet-another-user answer, I've created the same function with "standard" google instead of the HTTPIterator.https://stackoverflow.com/users/2350164/yet-another-user回答之后,我用“标准”谷歌而不是 HTTPIterator 创建了相同的 function。 Let's say we have: bucket named 'bucket_name' and subfolder named 'sub_folder_name'假设我们有:名为“bucket_name”的存储桶和名为“sub_folder_name”的子文件夹

from google.api_core import page_iterator
from google.cloud import storage
storage_client = storage.Client(project = PROJECT_NAME)
def get_folders_list(storage_client, bucket_or_name, prefix = ''):
        """
        the function returns the list of folders within a bucket or its subdirectory
        :param storage_client: the GCS client
        :param bucket_or_name: the name of the bucket
        :param prefix: the prefix if you want subdirectory
        :return: list of folders
        """
        if prefix and not prefix.endswith('/'):
            prefix += '/'

    blobs = storage_client.list_blobs(
        bucket_or_name=bucket_or_name,
        prefix=prefix,
        delimiter="/",
        # max_results=1
    )
    next(blobs, ...)
    return list(blobs.prefixes)

and you can use the 2 examples below for bucket or one of its subdirectories:您可以将以下 2 个示例用于存储桶或其子目录之一:

get_folders_list(storage_client = storage_client, bucket_or_name =
   'bucket_name')
get_folders_list(storage_client = storage_client, bucket_or_name = 'bucket_name', prefix = 'sub_folder_name')

Here is a faster way (found this in a github thread, posted by @evanj https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920 ):这是一种更快的方法(在 github 线程中找到,由 @evanj https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920发布):

def list_gcs_directories(bucket, prefix):
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print(page, page.prefixes)
        prefixes.update(page.prefixes)
    return prefixes

You want to call this function as follows:您想按如下方式调用此 function:

from google.cloud import storage

client = storage.Client()
bucket_name = 'my_bucket_name'
bucket_obj = client.bucket(bucket_name)
list_folders = list_gcs_directories(bucket_obj, prefix='my/prefix/path/within/bucket/')

# Getting rid of the prefix
list_folders = [''.join(indiv_folder.split('/')[-1])
                  for indiv_folder in list_folders]

You can get all the unique prefixes N levels deep inside a bucket with a the python Cloud Storage library and a one-liner like so, for example where N=2:您可以使用 python Cloud Storage 库和类似这样的单行代码在存储桶中获取所有唯一前缀 N 层,例如 N=2:

set(["/".join(blob.name.split('/',maxsplit=2)[0:2]) for blob in client.list_blobs(BUCKET_NAME)])

If you want to restrict your results to a particular "folder", add a prefix like so:如果您想将结果限制在特定的“文件夹”中,请添加如下前缀:

set(["/".join(blob.name.split('/',maxsplit=2)[0:2]) for blob in client.list_blobs(BUCKET_NAME, prefix=PREFIX)])

Because your prefix will be one or more levels deep, you will need to adjust N. For example, to get unique prefixes 2 levels deep inside a prefix that is already 1 level deep, N should be 3.因为您的前缀将是一层或多层深,所以您需要调整 N。例如,要在已经 1 层深的前缀中获得 2 层深的唯一前缀,N 应该是 3。


I'm also surprised that nobody on this thread mentioned the gcsfs library, which lets you do我也很惊讶这个线程上没有人提到 gcsfs 库,它可以让你做

gcs = gcsfs.GCSFileSystem()
gcs.ls(BUCKET_NAME)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 C# 中列出 Google Cloud Storage Bucket 中的嵌套文件夹 - List nested folders in Google Cloud Storage Bucket in C# 从 Google Cloud Storage Bucket 下载文件夹 - Downloading folders from Google Cloud Storage Bucket 仅列出 Cloud Function 存储桶 API 中 GCP GCS 中的顶级文件夹? - List only top level folders in GCP GCS from Cloud Function bucket API? 如何列出所有应用程序以使用 python API 在谷歌云平台的应用程序引擎上获取 appID - How to list all apps to get appID on app engine in google cloud platform using python API 如何编写 Terraform 模块在同一个谷歌云存储桶中创建多个文件夹? - How to write Terraform module to create multiple folders in same google cloud storage bucket? 如何从谷歌云存储桶中获取项目名称/ID? - How to get project name/id from Google Cloud Storage bucket? 我获得存储权限的服务帐户没有对 Google Cloud Storage 存储桶的 storage.objects.list 访问权限 - My service account which was given storage permissions does not have storage.objects.list access to the Google Cloud Storage bucket Google cloud get bucket - 适用于 cli 但不适用于 python - Google cloud get bucket - works with cli but not in python 我将如何获得 Api 密钥和 API 密钥以启用在谷歌云 firebase 中使用 Twitter 进行身份验证 - How will i get Api key and API secret for enable Authenticate Using Twitter in google cloud firebase Google Cloud Function:访问 Google Storage 存储桶中的文件夹并处理其中的文件 - Google Cloud Function: Access folders in a Google Storage bucket and process files from them
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM