简体   繁体   English

AWS Glue python 作业限制写入 S3 存储桶的数据量?

[英]AWS Glue python job limits the data amount to write in S3 bucket?

I've created a Glue job to read data from glue catalog and save it to an s3 bucket in parquet format.我创建了一个 Glue 作业来从胶水目录中读取数据并将其以镶木地板格式保存到 s3 存储桶中。 It works correctly, but the number of items is limited to 20. So every time the job is triggered, only 20 items gets saved in the bucket, and I would like to save all of them.它工作正常,但项目数量限制为 20。因此每次触发作业时,桶中只保存 20 个项目,我想保存所有项目。 Maybe I'm missing some additional property in the python script.也许我在 python 脚本中遗漏了一些额外的属性。

Here is the script (generated by AWS):这是脚本(由 AWS 生成):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
transformation_ctx = "datasource0"]

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "cargoprobe_data", table_name = "dev_scv_completed_executions", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [*field list*], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://bucketname"}, format = "parquet", transformation_ctx = "datasink4")

job.commit()

This is done automatically in the background, it is called partitioning.这是在后台自动完成的,称为分区。 You can repartition by calling您可以通过调用重新分区

partitioned_df = dropnullfields3.repartition(1)

to repartition your DynamicFrame to one file.DynamicFrame重新分区为一个文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 S3 解压缩文件并将其写回 S3 的 AWS Glue 作业 - AWS Glue job to unzip a file from S3 and write it back to S3 将数据从 dynamodb 发送到 s3 时,AWS Glue 作业书签不起作用 - AWS Glue Job-bookmark not working when sending data from dynamodb to s3 如何使用 Glue 作业将 JSON 从 s3 转换为 CSV 文件并将其保存在同一个 s3 存储桶中 - How to convert JSON to CSV file from s3 and save it in same s3 bucket using Glue job 如何将大量数据从 aws S3 存储桶传输到另一个存储桶? - How to transfer large amount of data from aws S3 bucket to another? Pyspark:从AWS:S3 bucket中读取数据并写入postgres表 - Pyspark: Read data from AWS:S3 bucket and write to postgres table AWS cron 作业将 S3 存储桶中的图像转换为 WebP - AWS cron job converting images in S3 bucket to WebP 使用 AWS Glue 将数据从 S3 加载到 Aurora Serverless - Load data from S3 into Aurora Serverless using AWS Glue 在 AWS Glue 中使用 S3 文件夹结构作为元数据 - Using S3 folder structure as meta data in AWS Glue AWS Config 能否写入启用了 object 锁定的 S3 存储桶? - Can AWS Config write to an S3 bucket with object locking enabled? AWS Glue 作业在写入 s3 时抛出 java.lang.OutOfMemoryError - AWS Glue job throwing java.lang.OutOfMemoryError while writing to s3
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM