Python：如何在不用完 Memory 的情況下提取 Google 雲存儲中的 Zip 文件？

Question

我需要將文件提取到 Google Cloud Storage 中的 zip 文件中。 我正在使用 python function 來執行此操作，但即使使用 Dask 集群並且每個 Dask worker 都有 20GB memory 限制，我仍然遇到 memory 問題。

我如何優化我的代碼以使其不消耗那么多 memory？ 也許分塊讀取 zip 文件並將它們流式傳輸到臨時文件，然后將此文件發送到 Google Cloud Storage？

將不勝感激這里的任何指導。

這是我的代碼：

@task
def unzip_files(
    bucket_name,
    zip_data
):
    file_date = zip_data['file_date']
    gcs_folder_path = zip_data['gcs_folder_path']
    gcs_blob_name = zip_data['gcs_blob_name']

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)

    destination_blob_pathname = f'{gcs_folder_path}/{gcs_blob_name}'
    blob = bucket.blob(destination_blob_pathname)
    zipbytes = io.BytesIO(blob.download_as_string())

    if is_zipfile(zipbytes):
        with ZipFile(zipbytes, 'r') as zipObj:
            extracted_file_paths = []
            for content_file_name in zipObj.namelist():
                content_file = zipObj.read(content_file_name)
                extracted_file_path = f'{gcs_folder_path}/hgdata_{file_date}_{content_file_name}'
                blob = bucket.blob(extracted_file_path)
                blob.upload_from_string(content_file)
                extracted_file_paths.append(f'gs://{bucket_name}/{extracted_file_path}')
        return extracted_file_paths

    else:
        return []

Answer 1

我不太了解您的代碼，但總的來說，dask 使用fsspec和gcsfs庫可以很好地處理像這樣的復雜文件操作。 例如（你不需要 Dask）

import fsspec

with fsspec.open_files("zip://*::gcs://gcs_folder_path/gcs_blob_name") as open_files:
    for of in open_files:
        with fsspec.open("gcs://{something from fo}", "wb") as f:
            data = True
            while data:
                data = of.read(2**22)
                f.write(data)

你可以改為

open_files = fssec.open_files(...)

並將循環與 Dask 並行化。

Python：如何在不用完 Memory 的情況下提取 Google 雲存儲中的 Zip 文件？

問題描述

1 個解決方案

解決方案1
1 已采納 2020-08-21 13:14:32

Python：如何在不用完 Memory 的情況下提取 Google 雲存儲中的 Zip 文件？

問題描述

1 個解決方案

解決方案1 1 已采納 2020-08-21 13:14:32

解決方案1
1 已采納 2020-08-21 13:14:32