从 GOOGLE CLOUD STORAGE BUCKET 下载多个文件

Question

Does anyone know how to pull all the data for a file that start with a 'Test' in a GCS bucket?有谁知道如何在 GCS 存储桶中提取以“测试”开头的文件的所有数据？ I have tried to use the glob.glob method and wildcard glob.glob('*.csv') and that has not worked.我曾尝试使用 glob.glob 方法和通配符glob.glob('*.csv')但没有奏效。 Any help would be much appreciated.任何帮助将非常感激。

from google.cloud import storage

def download_blob(bucket_name, source_blob_name, destination_file_name):
    
  bucket_name = 'gcs-bucket'
  source_blob_name = 'folder1/folder2/Test_data_01052022104530.csv'
  destination_file_name = r'C:\Users\path\path1\path2\path3\path4\Test.csv'
  storage_client = storage.Client()
  bucket = storage_client.bucket(bucket_name)
  blob = bucket.blob(source_blob_name)
  blob.download_to_filename(destination_file_name)

Answer 1

GCS doesn't have the concept of directories, so "folder1/folder2/Test_data_01052022104530.csv" is not a file called "Test_data_01052022104530.csv", but an object called "folder1/folder2/Test_data_01052022104530.csv". GCS没有目录的概念，所以“folder1/folder2/Test_data_01052022104530.csv”不是一个名为“Test_data_01052022104530.csv”的文件，而是一个名为“folder1/folder2/Test_data_01052022104530.csv”的对象。 So you don't want to look for files starting "Test", but objects with the string "/Test" anywhere in their name (or that start with "Test", for the root").因此，您不想查找以“Test”开头的文件，而是查找名称中任何位置带有字符串“/Test”的对象（或者以“Test”开头的根目录）。

GCS provides no facility to do this, you will have to list objects and filter to do it. GCS 不提供执行此操作的工具，您必须列出对象并进行筛选才能执行此操作。 For example:例如：

  client = storage.Client()
  for blob in client.list_blobs(BUCKET):
    if "/Test" in blob.name or blob.name.startswith("Test"):
      blob.download_to_filename(somewhere)

For large buckets, this might be slow, especially if there are few files that match.对于大型存储桶，这可能会很慢，尤其是在匹配的文件很少的情况下。 Some ways to speed it up:一些加快速度的方法：

If your data is structured so that you have a small number of directories, and a small fraction of the files will match: it may be faster to "walk the tree" using the prefix and delimiter options of list_blobs and then for each "directory" do a list_blobs with prefix directory+"/Test" to list matching objects only.如果您的数据的结构使得您拥有少量目录，并且一小部分文件将匹配：使用 list_blobs 的前缀和定界符选项“遍历树”可能会更快，然后为每个“目录”使用前缀directory+"/Test"执行 list_blobs 以仅列出匹配的对象。
Do listing or downloads in parallel并行列出或下载

从 GOOGLE CLOUD STORAGE BUCKET 下载多个文件

问题描述

1 个解决方案

解决方案1
0 2022-12-15 22:58:41

从 GOOGLE CLOUD STORAGE BUCKET 下载多个文件

问题描述

1 个解决方案

解决方案1 0 2022-12-15 22:58:41

解决方案1
0 2022-12-15 22:58:41