AWS Glue：ETL 作业会创建许多空输出文件

Question

I'm very new to this, so not sure if this script could be simplified/if I'm doing something wrong that's resulting in this happening.我对此很陌生，所以不确定这个脚本是否可以简化/如果我做错了什么导致这种情况发生。 I've written an ETL script for AWS Glue that writes to a directory within an S3 bucket.我为 AWS Glue 编写了一个 ETL 脚本，该脚本写入 S3 存储桶中的目录。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# catalog: database and table names
db_name = "events"
tbl_base_event_info = "base_event_info"
tbl_event_details = "event_details"

# output directories
output_dir = "s3://whatever/output"

# create dynamic frames from source tables
base_event_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_base_event_info)
event_details_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_event_details)

# join frames
base_event_source_df = workout_event_source.toDF()
event_details_source_df = workout_device_source.toDF()
enriched_event_df = base_event_source_df.join(event_details_source_df, "event_id")
enriched_event = DynamicFrame.fromDF(enriched_event_df, glueContext, "enriched_event")

# write frame to json files 
datasink = glueContext.write_dynamic_frame.from_options(frame = enriched_event, connection_type = "s3", connection_options = {"path": output_dir}, format = "json")
job.commit()

The base_event_info table has 4 columns: event_id , event_name , platform , client_info The event_details table has 2 columns: event_id , event_details base_event_info表有 4 列： event_id 、 event_name 、 platform 、 client_info event_details表有 2 列： event_id 、 event_details

The joined table schema should look like: event_id , event_name , platform , client_info , event_details连接表架构应如下所示： event_id 、 event_name 、 platform 、 client_info 、 event_details

After I run this job, I expected to get 2 json files, since that's how many records are in the resulting joined table.运行此作业后，我希望获得 2 个 json 文件，因为这是结果连接表中的记录数。 (There are two records in the tables with the same event_id ) However, what I get is about 200 files in the form of run-1540321737719-part-r-00000 , run-1540321737719-part-r-00001 , etc: （表中有两条记录具有相同的event_id ）但是，我得到的是大约 200 个文件，格式为run-1540321737719-part-r-00000 、 run-1540321737719-part-r-00001等：

198 files contain 0 bytes 198 个文件包含 0 个字节
2 files contain 250 bytes (each with the correct info corresponding to the enriched events) 2 个文件包含 250 个字节（每个文件都有对应于丰富事件的正确信息）

Is this the expected behavior?这是预期的行为吗？ Why is this job generating so many empty files?为什么这项工作会产生这么多空文件？ Is there something wrong with my script?我的脚本有问题吗？

Answer 1

The Spark SQL module contains the following default configuration: Spark SQL 模块包含以下默认配置：

spark.sql.shuffle.partitions set to 200. spark.sql.shuffle.partitions 设置为 200。

that's why you are getting 200 files in the first place.这就是为什么您首先要获得 200 个文件的原因。 You can check if this is the case by doing the following:您可以通过执行以下操作来检查是否是这种情况：

enriched_event_df.rdd.getNumPartitions()

if you get a value of 200 then you can change it with the number of files you want to generate with the following code:如果您得到 200 的值，那么您可以使用以下代码将其更改为要生成的文件数：

enriched_event_df.repartition(2)

The above code will create only two files with your data.上面的代码将只使用您的数据创建两个文件。

Answer 2

In my experience empty output files point to an error in transformations.根据我的经验，空输出文件指向转换错误。 You can debug these using the error functions .您可以使用错误函数调试这些。

Btw.顺便提一句。 why are you doing the joins using Spark DataFrames instead of DynamicFrames?为什么要使用 Spark DataFrames 而不是 DynamicFrames 进行连接？

Answer 3

Instead of repartition, you can add column like timestamp to the dataframe through spark sql transformation step and add it as partition key while writing the dataframe to S3您可以通过 spark sql 转换步骤将column like timestamp之column like timestamp添加到数据帧中，而不是重新分区，并在将数据帧写入 S3 时将其添加为分区键

For example: select replace(replace(replace(string(date_trunc('HOUR',current_timestamp())),'-',''),':',''),' ','') as datasetdate, * from myDataSource;例如： select replace(replace(replace(string(date_trunc('HOUR',current_timestamp())),'-',''),':',''),' ','') as datasetdate, * from myDataSource;

use datasetdate as partitionkey while writing dynamicframe, glue job should be able to add partitions automatically写入动态datasetdate时使用datasetdate作为partitionkey ，粘合作业应该能够自动添加分区

AWS Glue：ETL 作业会创建许多空输出文件

问题描述

3 个解决方案

解决方案1
4 2018-11-14 22:40:12

解决方案2
0 2018-11-09 15:15:22

解决方案3
-1 2021-06-16 13:13:36

AWS Glue：ETL 作业会创建许多空输出文件

问题描述

3 个解决方案

解决方案1 4 2018-11-14 22:40:12

解决方案2 0 2018-11-09 15:15:22

解决方案3 -1 2021-06-16 13:13:36

解决方案1
4 2018-11-14 22:40:12

解决方案2
0 2018-11-09 15:15:22

解决方案3
-1 2021-06-16 13:13:36