简体   繁体   English

从 GOOGLE CLOUD STORAGE BUCKET 下载多个文件

[英]Download multiple files from GOOGLE CLOUD STORAGE BUCKET

Does anyone know how to pull all the data for a file that start with a 'Test' in a GCS bucket?有谁知道如何在 GCS 存储桶中提取以“测试”开头的文件的所有数据? I have tried to use the glob.glob method and wildcard glob.glob('*.csv') and that has not worked.我曾尝试使用 glob.glob 方法和通配符glob.glob('*.csv')但没有奏效。 Any help would be much appreciated.任何帮助将非常感激。

from google.cloud import storage

def download_blob(bucket_name, source_blob_name, destination_file_name):
    
  bucket_name = 'gcs-bucket'
  source_blob_name = 'folder1/folder2/Test_data_01052022104530.csv'
  destination_file_name = r'C:\Users\path\path1\path2\path3\path4\Test.csv'
  storage_client = storage.Client()
  bucket = storage_client.bucket(bucket_name)
  blob = bucket.blob(source_blob_name)
  blob.download_to_filename(destination_file_name)

GCS doesn't have the concept of directories, so "folder1/folder2/Test_data_01052022104530.csv" is not a file called "Test_data_01052022104530.csv", but an object called "folder1/folder2/Test_data_01052022104530.csv". GCS没有目录的概念,所以“folder1/folder2/Test_data_01052022104530.csv”不是一个名为“Test_data_01052022104530.csv”的文件,而是一个名为“folder1/folder2/Test_data_01052022104530.csv”的对象。 So you don't want to look for files starting "Test", but objects with the string "/Test" anywhere in their name (or that start with "Test", for the root").因此,您不想查找以“Test”开头的文件,而是查找名称中任何位置带有字符串“/Test”的对象(或者以“Test”开头的根目录)。

GCS provides no facility to do this, you will have to list objects and filter to do it. GCS 不提供执行此操作的工具,您必须列出对象并进行筛选才能执行此操作。 For example:例如:

  client = storage.Client()
  for blob in client.list_blobs(BUCKET):
    if "/Test" in blob.name or blob.name.startswith("Test"):
      blob.download_to_filename(somewhere)

For large buckets, this might be slow, especially if there are few files that match.对于大型存储桶,这可能会很慢,尤其是在匹配的文件很少的情况下。 Some ways to speed it up:一些加快速度的方法:

  • If your data is structured so that you have a small number of directories, and a small fraction of the files will match: it may be faster to "walk the tree" using the prefix and delimiter options of list_blobs and then for each "directory" do a list_blobs with prefix directory+"/Test" to list matching objects only.如果您的数据的结构使得您拥有少量目录,并且一小部分文件将匹配:使用 list_blobs 的前缀和定界符选项“遍历树”可能会更快,然后为每个“目录”使用前缀directory+"/Test"执行 list_blobs 以仅列出匹配的对象。

  • Do listing or downloads in parallel并行列出或下载

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Google Cloud Storage 中下载多个文件 - How to download multiple files in Google Cloud Storage 如何将 Google Cloud Storage 中的文件从一个存储桶移动到另一个存储桶 Python - How to move files in Google Cloud Storage from one bucket to another bucket by Python 如何将文件从谷歌云存储下载到用户电脑(React) - How to download files from google cloud storage to user pc (React) 允许用户从谷歌云存储下载单个和批量文件 - Alllow users to download single & bulk files from google cloud storage 从 Google Cloud Storage Bucket 下载文件夹 - Downloading folders from Google Cloud Storage Bucket 一旦使用 apache 光束 sdk 在 Google Cloud 中创建数据流作业,我们可以从云存储桶中删除 tmp 文件吗? - Once dataflow job is created in Google Cloud using apache beam sdk, can we delete the tmp files from cloud storage bucket? 谷歌云存储加入多个 csv 文件 - Google Cloud Storage Joining multiple csv files Google Cloud Storage 对存储桶中的对象进行分页 (PHP) - Google Cloud Storage paginate objects in a bucket (PHP) 如何从带有子文件夹的 Google Cloud Storage 存储桶中检索文件夹名称 - How to retrieve the folder name from Google Cloud Storage bucket with sub-folder Google Cloud Storage:下载具有不同名称的文件 - Google Cloud Storage: download a file with a different name
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM