简体繁体 English

AWS 中的文件转换

[英]File conversion in AWS

原文 2022-12-12 10:27:49 5 1 amazon-s3/ aws-lambda/ aws-glue/ amazon-data-pipeline

I am trying to find the most efficient way to process files in AWS.我试图找到在 AWS 中处理文件的最有效方法。

Read a json, xml, csv from S3 bucket从 S3 存储桶中读取 json、xml、csv
Map it to another type of json, xml, csv将其映射到另一种类型的 json、xml、csv
Save it to S3 bucket将其保存到 S3 存储桶

Right now we are using Java with AWS lambdas but we write lots of code.现在我们将 Java 与 AWS lambda 一起使用，但我们编写了很多代码。 AWS Data Glue looks good but my experience with MS BizTalk is even better. AWS Data Glue 看起来不错，但我对 MS BizTalk 的体验更好。

Is there any service that can help me with this?有什么服务可以帮我解决这个问题吗？

1 个解决方案

There are many options available within AWS for reading from one file format and writing it to another file format in s3 bucket. AWS 中有许多选项可用于从一种文件格式读取并将其写入 s3 存储桶中的另一种文件格式。 Below are some options -以下是一些选项 -

A) AWS SDK for Pandas (DataWrangler) which is an open source Python library from AWS ProServe. A) AWS SDK for Pandas (DataWrangler) ，它是 AWS ProServe 的开源 Python 库。 You can run this either from a Lambda, or any other server.您可以从 Lambda 或任何其他服务器运行它。 It provides several out of the box connectors for reading, writing data from various sources and sinks.它提供了几个开箱即用的连接器，用于从各种源和接收器读取、写入数据。 This option may be used if the volumes are low.如果体积较小，则可以使用此选项。 It also provides the flexibility to use this from Amazon Lambda or any other server where the SDK can be installed.它还提供了从 Amazon Lambda 或可以安装 SDK 的任何其他服务器使用它的灵活性。

B) AWS Glue either using Spark or Python which is a is a serverless data integration service. B) AWS Glue使用 Spark 或 Python，这是一种无服务器数据集成服务。 This also provides a drag and drop option using the Glue Studio to generate data pipelines using many out of the box transformations.这也提供了一个拖放选项，使用 Glue Studio 使用许多开箱即用的转换来生成数据管道。 One can control the processing windows by using the desired number of Data Processing Units (DPUs).可以通过使用所需数量的数据处理单元 (DPU) 来控制处理窗口。 It also has the Glue Workflow for orchestration.它还具有用于编排的 Glue Workflow。

C) EMR which is a PetaByte scale AWS Service that one can use for high volume distributed data processing, machine learning, interactive analytics using open source frameworks like Apache Spark. C) EMR ，这是一种 PB 级规模的 AWS 服务，可用于使用 Apache Spark 等开源框架进行大容量分布式数据处理、机器学习和交互式分析。

Which option one would choose would depend on the use cases one is trying to solve and the requirements.选择哪个选项取决于要解决的用例和要求。 Other factors like volume of data, processing window, low code\no code options, cost, etc. would help decide which option to leverage.数据量、处理窗口、低代码\无代码选项、成本等其他因素将有助于决定利用哪个选项。