如何使用 AWS Glue 从 S3 存储桶合并 CSV 文件并将其保存回 S3

Question

Objective is to transform the data (csv files) from one S3 bucket to another S3 bucket - using Glue.目标是使用 Glue 将数据（csv 文件）从一个 S3 存储桶转换为另一个 S3 存储桶。

What I already tried:我已经尝试过的：

I created a CSV classifier.我创建了一个 CSV 分类器。 I created a crawler which scans the data coming in S3 bucket.我创建了一个爬虫来扫描 S3 存储桶中的数据。 Where I am stuck:我被困的地方：

Unable to find how can we store the output in S3 again without saving it in any RDS or other database services.无法找到如何将输出再次存储在 S3 中而不将其保存在任何 RDS 或其他数据库服务中。 Because Glue output is asking for database output, which I don't have and don't want to use.因为 Glue 输出要求数据库输出，我没有也不想使用。

Is there any way I can achieve the goal without using any other DB system, just plain - S3, Glue?有什么方法可以在不使用任何其他数据库系统的情况下实现目标，只是简单的 - S3，Glue？

More Information Sample single CSV file, I am trying to merge更多信息示例单个 CSV 文件，我正在尝试合并

Classifier with delimeter of ";"分隔符为“;”的分类器

Crawler Configuration爬虫配置

Crawler Result (No schema detected)爬虫结果（未检测到架构）

Answer 1

The reason why Glue crawler detected schema is UNKNOWN because of the number of rows present in the source files.由于源文件中存在的行数，Glue 爬网程序检测到架构的原因是未知的。 Refer to section Built-In CSV Classifier in this doc which you are using in your case.请参阅本文档中您在案例中使用的内置 CSV 分类器部分。

According to the doc to be classified as CSV, the table schema must have at least two columns and two rows of data.根据要归类为CSV的文档，表模式必须至少有两列两行数据。

In your case you can use AWS Glue job and read files directly from S3 using either of below ways:在您的情况下，您可以使用 AWS Glue 作业并使用以下任一方式直接从 S3 读取文件：

1.Create a dynamicframe and pass spearator as ; 1.创建一个动态框架并将spearator作为; in format_options.在格式选项中。 Below is sample which you can modify according to your needs.以下是您可以根据需要进行修改的示例。

dyF = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths": [InputDir]},format="csv",format_options={"withHeader": True,"separator": ";","quoteChar": '"',"escaper": '"'},transformation_ctx = "taxidata")

2.Use spark dataframe to read data from S3 and then convert it back to dynamicframe if you want to levarage Glue native transformations: 2.如果您想利用 Glue 原生转换，请使用 spark 数据帧从 S3 读取数据，然后将其转换回动态帧：

df = spark.read.options(delimiter=';').csv("s3://path-to-files/")

If you want to merge files with different schemas then read data containing different schema into different frames of your choice and then merge them using a Join operator.如果要合并具有不同架构的文件，则将包含不同架构的数据读取到您选择的不同框架中，然后使用 Join 运算符合并它们。

Refer to this which has example code to join and write data back to s3.请参阅this ，其中包含用于将数据连接并写回 s3 的示例代码。

如何使用 AWS Glue 从 S3 存储桶合并 CSV 文件并将其保存回 S3

问题描述

1 个解决方案

解决方案1
0 2020-09-10 12:05:56

如何使用 AWS Glue 从 S3 存储桶合并 CSV 文件并将其保存回 S3

问题描述

1 个解决方案

解决方案1 0 2020-09-10 12:05:56

解决方案1
0 2020-09-10 12:05:56