[英]Reading multiple files from Google Storage using Python client asynchronously
我正在嘗試讀取上傳到 Google 存儲桶的文件列表並將它們加載到文件/緩沖區,以便我可以對這些文件執行一些聚合。
到目前為止,我能夠以串行方式讀取所有文件的內容(每個 blob object 來自包含存儲桶中所有文件的迭代器)。 但是,我已將數千個文件上傳到谷歌雲存儲,甚至閱讀這些文件也需要花費大量時間。
from google.cloud import storage
import json
import time
import multiprocessing
from multiprocessing import Pool, Manager
cpu_count = multiprocessing.cpu_count()
manager = Manager()
finalized_list = manager.list()
# Explicitly use service account credentials by specifying the private key file.
storage_client = storage.Client.from_service_account_json('.serviceAccountCredentials.json')
bucket_name = "bucket-name"
def list_blobs():
blobs = storage_client.list_blobs(bucket_name)
return blobs
def read_blob(blob):
bucket = storage_client.bucket(bucket_name)
blob_object = bucket.blob(blob)
with blob_object.open("r") as f:
converted_string = f.read()
print(converted_string)
finalized_list.append(converted_string)
def main():
start_time = time.time()
print("Start time: ", start_time)
pool = Pool(processes=cpu_count)
blobs = list_blobs()
pool.map(read_blob, [blob for blob in blobs])
end_time = time.time()
elapsed_time = end_time - start_time
print("Time taken: ", elapsed_time, " seconds")
if __name__ == "__main__":
main()
在上面的代碼片段中,我想在 python 中使用多處理來讀取存儲桶中的每個 blob object,但是,由於 google 雲存儲返回的 blob object 不是標准迭代器/列表 object,我收到一條錯誤消息Pickling client objects is not explicitly supported
有沒有其他方法可以使用python腳本從雲存儲中快速獲取和讀取數千個文件?
這是我幾年前用 concurrent.futures.ProcessPoolExecutor 做的一個解決方案(我做了一個 cpu 繁重的任務。如果你主要是在等待返回,你也可以使用 concurrent.futures.ThreadPoolExecutor)
from google.cloud import storage
# multi CPU
import concurrent.futures
# progress bar
from tqdm import tqdm
bucket_name = 'your_bucket'
path_to_folder = 'your_path_to_the_files'
file_ending = '.pkl'
kwargs_bucket={
'bucket_or_name': bucket_name,
#'max_results': 60, # comment if you want to run it on all files
'prefix': path_to_folder
}
kwargs_process_pool={
#'max_workers': 1 #comment if you want full speed
}
# a list to store the output
results = []
# connect to the bucket
client = storage.Client()
bucket = client.get_bucket(bucket_name)
# multi CPU OCR
futures = []
# progress bar
with tqdm(total=sum(1 for blob in client.list_blobs(**kwargs_bucket) if blob.name.endswith(file_ending)), position=0, leave=True) as pbar:
#ProcessPoolExecutor
with concurrent.futures.ProcessPoolExecutor(**kwargs_process_pool) as executor:
# getting all the files from the bucket
for blob in client.list_blobs(**kwargs_bucket):
# skip the folder
if not blob.name.endswith(file_ending):
continue
# calling the class above with the ProcessPoolExecutor
futures.append(executor.submit(your_function, blob.name))
# updating the progress bar and checking the return
for future in concurrent.futures.as_completed(futures):
pbar.update(1)
if future.result() != '':
results.append(future.result())
我想出了一個困難的方法,你應該只將變量而不是對象傳遞給執行程序的 your_function 。 這就是我傳遞 blob.name 的原因。
希望有幫助
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.