Python：如何在不用完 Memory 的情况下提取 Google 云存储中的 Zip 文件？

Question

我需要将文件提取到 Google Cloud Storage 中的 zip 文件中。 我正在使用 python function 来执行此操作，但即使使用 Dask 集群并且每个 Dask worker 都有 20GB memory 限制，我仍然遇到 memory 问题。

我如何优化我的代码以使其不消耗那么多 memory？ 也许分块读取 zip 文件并将它们流式传输到临时文件，然后将此文件发送到 Google Cloud Storage？

将不胜感激这里的任何指导。

这是我的代码：

@task
def unzip_files(
    bucket_name,
    zip_data
):
    file_date = zip_data['file_date']
    gcs_folder_path = zip_data['gcs_folder_path']
    gcs_blob_name = zip_data['gcs_blob_name']

    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)

    destination_blob_pathname = f'{gcs_folder_path}/{gcs_blob_name}'
    blob = bucket.blob(destination_blob_pathname)
    zipbytes = io.BytesIO(blob.download_as_string())

    if is_zipfile(zipbytes):
        with ZipFile(zipbytes, 'r') as zipObj:
            extracted_file_paths = []
            for content_file_name in zipObj.namelist():
                content_file = zipObj.read(content_file_name)
                extracted_file_path = f'{gcs_folder_path}/hgdata_{file_date}_{content_file_name}'
                blob = bucket.blob(extracted_file_path)
                blob.upload_from_string(content_file)
                extracted_file_paths.append(f'gs://{bucket_name}/{extracted_file_path}')
        return extracted_file_paths

    else:
        return []

Answer 1

我不太了解您的代码，但总的来说，dask 使用fsspec和gcsfs库可以很好地处理像这样的复杂文件操作。 例如（你不需要 Dask）

import fsspec

with fsspec.open_files("zip://*::gcs://gcs_folder_path/gcs_blob_name") as open_files:
    for of in open_files:
        with fsspec.open("gcs://{something from fo}", "wb") as f:
            data = True
            while data:
                data = of.read(2**22)
                f.write(data)

你可以改为

open_files = fssec.open_files(...)

并将循环与 Dask 并行化。

Python：如何在不用完 Memory 的情况下提取 Google 云存储中的 Zip 文件？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-21 13:14:32

Python：如何在不用完 Memory 的情况下提取 Google 云存储中的 Zip 文件？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-21 13:14:32

解决方案1
1 已采纳 2020-08-21 13:14:32