從 S3 解壓縮文件並將其寫回 S3 的 AWS Glue 作業

Question

我是 AWS Glue 的新手，我想使用 AWS Glue 解壓縮 S3 存儲桶中存在的巨大文件，並將內容寫回 S3。

嘗試用谷歌搜索此要求時我找不到任何東西。

我的問題是：

如何將 zip 文件作為數據源添加到 AWS Glue？
如何將其寫回相同的 S3 位置？

我正在使用 AWS Glue Studio。 任何幫助將不勝感激。

Answer 1

如果您仍在尋找解決方案。 您可以使用boto3和 Python 的zipfile庫解壓縮文件並使用 AWS Glue 作業將其寫回。

需要考慮的是要處理的 zip 的大小。 我將以下腳本與6GB（壓縮）30GB（解壓縮）文件一起使用，並且運行良好。 但如果文件太重而工作人員無法緩沖，則可能會失敗。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

import boto3
import io
from zipfile import ZipFile

s3 = boto3.client("s3")

bucket = "wayfair-datasource" # your s3 bucket name
prefix = "files/location/" # the prefix for the objects that you want to unzip
unzip_prefix = "files/unzipped_location/" # the location where you want to store your unzipped files 

# Get a list of all the resources in the specified prefix
objects = s3.list_objects(
    Bucket=bucket,
    Prefix=prefix
)["Contents"]

# The following will get the unzipped files so the job doesn't try to unzip a file that is already unzipped on every run
unzipped_objects = s3.list_objects(
    Bucket=bucket,
    Prefix=unzip_prefix
)["Contents"]

# Get a list containing the keys of the objects to unzip
object_keys = [ o["Key"] for o in objects if o["Key"].endswith(".zip") ] 
# Get the keys for the unzipped objects
unzipped_object_keys = [ o["Key"] for o in unzipped_objects ] 

for key in object_keys:
    obj = s3.get_object(
        Bucket="wayfair-datasource",
        Key=key
    )
    
    objbuffer = io.BytesIO(obj["Body"].read())
    
    # using context manager so you don't have to worry about manually closing the file
    with ZipFile(objbuffer) as zip:
        filenames = zip.namelist()

        # iterate over every file inside the zip
        for filename in filenames:
            with zip.open(filename) as file:
                filepath = unzip_prefix + filename
                if filepath not in unzipped_object_keys:
                    s3.upload_fileobj(file, bucket, filepath)

job.commit()

Answer 2

嘗試搜索此要求時，我找不到任何東西。

您找不到任何關於此的信息，因為這不是 Glue 所做的。 Glue 可以原生讀取gzip （不是zip ）文件。 如果您有zip ，那么您必須自己在 S3 中轉換所有文件。 膠水不會這樣做。

要轉換文件，您可以下載、重新打包並以gzip格式或 Glue 支持的任何其他格式重新上傳。

從 S3 解壓縮文件並將其寫回 S3 的 AWS Glue 作業

問題描述

2 個解決方案

解決方案1
2 2022-12-02 14:42:52

解決方案2
0 已采納 2021-05-21 06:56:16

從 S3 解壓縮文件並將其寫回 S3 的 AWS Glue 作業

問題描述

2 個解決方案

解決方案1 2 2022-12-02 14:42:52

解決方案2 0 已采納 2021-05-21 06:56:16

解決方案1
2 2022-12-02 14:42:52

解決方案2
0 已采納 2021-05-21 06:56:16