使用 Python 客戶端異步從 Google Storage 讀取多個文件

Question

我正在嘗試讀取上傳到 Google 存儲桶的文件列表並將它們加載到文件/緩沖區，以便我可以對這些文件執行一些聚合。

到目前為止，我能夠以串行方式讀取所有文件的內容（每個 blob object 來自包含存儲桶中所有文件的迭代器）。 但是，我已將數千個文件上傳到谷歌雲存儲，甚至閱讀這些文件也需要花費大量時間。

from google.cloud import storage
import json
import time

import multiprocessing
from multiprocessing import Pool, Manager

cpu_count = multiprocessing.cpu_count()
manager = Manager()
finalized_list = manager.list()

# Explicitly use service account credentials by specifying the private key file.
storage_client = storage.Client.from_service_account_json('.serviceAccountCredentials.json')
bucket_name = "bucket-name"

def list_blobs():
    blobs = storage_client.list_blobs(bucket_name)
    return blobs


def read_blob(blob):
    bucket = storage_client.bucket(bucket_name)
    blob_object = bucket.blob(blob)
    with blob_object.open("r") as f:
        converted_string = f.read()
        print(converted_string)
        finalized_list.append(converted_string)

def main():
    start_time = time.time()
    print("Start time: ", start_time)

    pool = Pool(processes=cpu_count)
    blobs = list_blobs()
    pool.map(read_blob, [blob for blob in blobs])
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    print("Time taken: ", elapsed_time, " seconds")

if __name__ == "__main__":
    main()

在上面的代碼片段中，我想在 python 中使用多處理來讀取存儲桶中的每個 blob object，但是，由於 google 雲存儲返回的 blob object 不是標准迭代器/列表 object，我收到一條錯誤消息Pickling client objects is not explicitly supported

有沒有其他方法可以使用python腳本從雲存儲中快速獲取和讀取數千個文件？

Answer 1

這是我幾年前用 concurrent.futures.ProcessPoolExecutor 做的一個解決方案（我做了一個 cpu 繁重的任務。如果你主要是在等待返回，你也可以使用 concurrent.futures.ThreadPoolExecutor）

from google.cloud import storage

# multi CPU
import concurrent.futures

# progress bar
from tqdm import tqdm

bucket_name = 'your_bucket'
path_to_folder = 'your_path_to_the_files'
file_ending = '.pkl'

kwargs_bucket={
    'bucket_or_name': bucket_name,
    #'max_results': 60, # comment if you want to run it on all files
    'prefix': path_to_folder
}

kwargs_process_pool={
    #'max_workers': 1 #comment if you want full speed
}

# a list to store the output
results = []

# connect to the bucket
client = storage.Client()
bucket = client.get_bucket(bucket_name)

# multi CPU OCR
futures = []
# progress bar
with tqdm(total=sum(1 for blob in client.list_blobs(**kwargs_bucket) if blob.name.endswith(file_ending)), position=0, leave=True) as pbar:
    #ProcessPoolExecutor
    with concurrent.futures.ProcessPoolExecutor(**kwargs_process_pool) as executor:
        # getting all the files from the bucket
        for blob in client.list_blobs(**kwargs_bucket):
            # skip the folder
            if not blob.name.endswith(file_ending):
                continue
            # calling the class above with the ProcessPoolExecutor
            futures.append(executor.submit(your_function, blob.name))

        # updating the progress bar and checking the return
        for future in concurrent.futures.as_completed(futures):
            pbar.update(1)
            if future.result() != '':
                results.append(future.result())

我想出了一個困難的方法，你應該只將變量而不是對象傳遞給執行程序的 your_function 。 這就是我傳遞 blob.name 的原因。

希望有幫助

使用 Python 客戶端異步從 Google Storage 讀取多個文件

問題描述

1 個解決方案

解決方案1
1 2023-01-06 08:23:16

使用 Python 客戶端異步從 Google Storage 讀取多個文件

問題描述

1 個解決方案

解決方案1 1 2023-01-06 08:23:16

解決方案1
1 2023-01-06 08:23:16