[英]AWS Glue: ETL job creates many empty output files
I'm very new to this, so not sure if this script could be simplified/if I'm doing something wrong that's resulting in this happening.我对此很陌生,所以不确定这个脚本是否可以简化/如果我做错了什么导致这种情况发生。 I've written an ETL script for AWS Glue that writes to a directory within an S3 bucket.我为 AWS Glue 编写了一个 ETL 脚本,该脚本写入 S3 存储桶中的目录。
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# catalog: database and table names
db_name = "events"
tbl_base_event_info = "base_event_info"
tbl_event_details = "event_details"
# output directories
output_dir = "s3://whatever/output"
# create dynamic frames from source tables
base_event_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_base_event_info)
event_details_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_event_details)
# join frames
base_event_source_df = workout_event_source.toDF()
event_details_source_df = workout_device_source.toDF()
enriched_event_df = base_event_source_df.join(event_details_source_df, "event_id")
enriched_event = DynamicFrame.fromDF(enriched_event_df, glueContext, "enriched_event")
# write frame to json files
datasink = glueContext.write_dynamic_frame.from_options(frame = enriched_event, connection_type = "s3", connection_options = {"path": output_dir}, format = "json")
job.commit()
The base_event_info
table has 4 columns: event_id
, event_name
, platform
, client_info
The event_details
table has 2 columns: event_id
, event_details
base_event_info
表有 4 列: event_id
、 event_name
、 platform
、 client_info
event_details
表有 2 列: event_id
、 event_details
The joined table schema should look like: event_id
, event_name
, platform
, client_info
, event_details
连接表架构应如下所示: event_id
、 event_name
、 platform
、 client_info
、 event_details
After I run this job, I expected to get 2 json files, since that's how many records are in the resulting joined table.运行此作业后,我希望获得 2 个 json 文件,因为这是结果连接表中的记录数。 (There are two records in the tables with the same event_id
) However, what I get is about 200 files in the form of run-1540321737719-part-r-00000
, run-1540321737719-part-r-00001
, etc: (表中有两条记录具有相同的event_id
)但是,我得到的是大约 200 个文件,格式为run-1540321737719-part-r-00000
、 run-1540321737719-part-r-00001
等:
Is this the expected behavior?这是预期的行为吗? Why is this job generating so many empty files?为什么这项工作会产生这么多空文件? Is there something wrong with my script?我的脚本有问题吗?
The Spark SQL module contains the following default configuration: Spark SQL 模块包含以下默认配置:
spark.sql.shuffle.partitions set to 200. spark.sql.shuffle.partitions 设置为 200。
that's why you are getting 200 files in the first place.这就是为什么您首先要获得 200 个文件的原因。 You can check if this is the case by doing the following:您可以通过执行以下操作来检查是否是这种情况:
enriched_event_df.rdd.getNumPartitions()
if you get a value of 200 then you can change it with the number of files you want to generate with the following code:如果您得到 200 的值,那么您可以使用以下代码将其更改为要生成的文件数:
enriched_event_df.repartition(2)
The above code will create only two files with your data.上面的代码将只使用您的数据创建两个文件。
In my experience empty output files point to an error in transformations.根据我的经验,空输出文件指向转换错误。 You can debug these using the error functions .您可以使用错误函数调试这些。
Btw.顺便提一句。 why are you doing the joins using Spark DataFrames instead of DynamicFrames?为什么要使用 Spark DataFrames 而不是 DynamicFrames 进行连接?
Instead of repartition, you can add column like timestamp
to the dataframe through spark sql transformation step and add it as partition key while writing the dataframe to S3您可以通过 spark sql 转换步骤将column like timestamp
之column like timestamp
添加到数据帧中,而不是重新分区,并在将数据帧写入 S3 时将其添加为分区键
For example: select replace(replace(replace(string(date_trunc('HOUR',current_timestamp())),'-',''),':',''),' ','') as datasetdate, * from myDataSource;
例如: select replace(replace(replace(string(date_trunc('HOUR',current_timestamp())),'-',''),':',''),' ','') as datasetdate, * from myDataSource;
use datasetdate
as partitionkey
while writing dynamicframe, glue job should be able to add partitions automatically写入动态datasetdate
时使用datasetdate
作为partitionkey
,粘合作业应该能够自动添加分区
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.