[英]Download multiple files from GOOGLE CLOUD STORAGE BUCKET
Does anyone know how to pull all the data for a file that start with a 'Test' in a GCS bucket?有谁知道如何在 GCS 存储桶中提取以“测试”开头的文件的所有数据? I have tried to use the glob.glob method and wildcard glob.glob('*.csv')
and that has not worked.我曾尝试使用 glob.glob 方法和通配符glob.glob('*.csv')
但没有奏效。 Any help would be much appreciated.任何帮助将非常感激。
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
bucket_name = 'gcs-bucket'
source_blob_name = 'folder1/folder2/Test_data_01052022104530.csv'
destination_file_name = r'C:\Users\path\path1\path2\path3\path4\Test.csv'
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
GCS doesn't have the concept of directories, so "folder1/folder2/Test_data_01052022104530.csv" is not a file called "Test_data_01052022104530.csv", but an object called "folder1/folder2/Test_data_01052022104530.csv". GCS没有目录的概念,所以“folder1/folder2/Test_data_01052022104530.csv”不是一个名为“Test_data_01052022104530.csv”的文件,而是一个名为“folder1/folder2/Test_data_01052022104530.csv”的对象。 So you don't want to look for files starting "Test", but objects with the string "/Test" anywhere in their name (or that start with "Test", for the root").因此,您不想查找以“Test”开头的文件,而是查找名称中任何位置带有字符串“/Test”的对象(或者以“Test”开头的根目录)。
GCS provides no facility to do this, you will have to list objects and filter to do it. GCS 不提供执行此操作的工具,您必须列出对象并进行筛选才能执行此操作。 For example:例如:
client = storage.Client()
for blob in client.list_blobs(BUCKET):
if "/Test" in blob.name or blob.name.startswith("Test"):
blob.download_to_filename(somewhere)
For large buckets, this might be slow, especially if there are few files that match.对于大型存储桶,这可能会很慢,尤其是在匹配的文件很少的情况下。 Some ways to speed it up:一些加快速度的方法:
If your data is structured so that you have a small number of directories, and a small fraction of the files will match: it may be faster to "walk the tree" using the prefix and delimiter options of list_blobs and then for each "directory" do a list_blobs with prefix directory+"/Test"
to list matching objects only.如果您的数据的结构使得您拥有少量目录,并且一小部分文件将匹配:使用 list_blobs 的前缀和定界符选项“遍历树”可能会更快,然后为每个“目录”使用前缀directory+"/Test"
执行 list_blobs 以仅列出匹配的对象。
Do listing or downloads in parallel并行列出或下载
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.