从 S3 事件触发 AWS EMR Spark 作业

Question

I am considering using AWS EMR Spark to run a Spark application against very large Parquet files stored on S3.我正在考虑使用 AWS EMR Spark 针对存储在 S3 上的非常大的 Parquet 文件运行 Spark 应用程序。 The overall flow here is that a Java process would upload these large files to S3, and I'd like to automatically trigger the running of a Spark job (injected with the S3 keyname(s) of the files uploaded) on those files.这里的整体流程是 Java 进程将这些大文件上传到 S3，我想自动触发在这些文件上运行 Spark 作业（注入上传文件的 S3 键名）。

Ideally, there would be some kind of S3-based EMR trigger available to wire up;理想情况下，会有某种基于 S3 的 EMR 触发器可用于连接； that is, I configure EMR/Spark to "listen" to an S3 bucket and to kick off a Spark job when an upsertis made to that bucket.也就是说，我将 EMR/Spark 配置为“侦听”S3 存储桶，并在对该存储桶进行 upsertis 时启动 Spark 作业。

If no such trigger exists, I could probably kludge something together, such as kick off a Lambda from the S3 event, and have the Lambda somehow trigger the EMR Spark job.如果不存在这样的触发器，我可能会拼凑一些东西，例如从 S3 事件中启动 Lambda，并让 Lambda 以某种方式触发 EMR Spark 作业。

However my understanding ( please correct me if I'm wrong) is that the only way to kick off a Spark job is to:然而，我的理解（如果我错了，请纠正我）是启动 Spark 工作的唯一方法是：

Package the job up as an executable JAR file;将作业打包为可执行 JAR 文件； and和
Submit it to the cluster (EMR or otherwise) via the spark-submit shell script通过spark-submit shell 脚本将其提交到集群（EMR 或其他）

So if I have to do the Lambda-based kludge, I'm not exactly sure what the best way to trigger the EMR/Spark job is, seeing that Lambdas don't natively carry spark-submit in their runtimes.因此，如果我必须执行基于 Lambda 的 kludge，我不确定触发 EMR/Spark 作业的最佳方法是什么，因为 Lambda 本身并不在其运行时中携带spark-submit 。 And even if I configured my own Lambda runtime (which I believe is now possible to do), this solution already feels really wonky and fault-intolerant.即使我配置了自己的 Lambda 运行时（我相信现在可以做到），这个解决方案已经让人感觉非常不稳定和容错。

Anybody ever trigger an EMR/Spark job from an S3 trigger or any AWS trigger before?有人曾经从 S3 触发器或任何AWS 触发器触发过 EMR/Spark 作业吗？

Answer 1

EMR Spark job can be executed as a step as in Adding a Spark Step . EMR Spark 作业可以作为添加 Spark 步骤中的一个步骤执行。 Step is not just at the EMR cluster creation time after bootstrap. Step 不仅仅是在引导后的 EMR 集群创建时间。

aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/examples/jars/spark-examples.jar,10]

As it is a AWS CLI, you can invoke it from Lambda in which also you can upload the jar file to HDFS or S3, then point it using s3:// or hdfs://.由于它是一个 AWS CLI，您可以从 Lambda 调用它，您也可以在其中将 jar 文件上传到 HDFS 或 S3，然后使用 s3:// 或 hdfs:// 指向它。

The document also has a Java example.该文档还有一个 Java 示例。

AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(credentials);

StepFactory stepFactory = new StepFactory();
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
AddJobFlowStepsRequest req = new AddJobFlowStepsRequest();
req.withJobFlowId("j-1K48XXXXXXHCB");

List<StepConfig> stepConfigs = new ArrayList<StepConfig>();

HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
            .withJar("command-runner.jar")
            .withArgs("spark-submit","--executor-memory","1g","--class","org.apache.spark.examples.SparkPi","/usr/lib/spark/examples/jars/spark-examples.jar","10");            

StepConfig sparkStep = new StepConfig()
            .withName("Spark Step")
            .withActionOnFailure("CONTINUE")
            .withHadoopJarStep(sparkStepConf);

stepConfigs.add(sparkStep);
req.withSteps(stepConfigs);
AddJobFlowStepsResult result = emr.addJobFlowSteps(req);

从 S3 事件触发 AWS EMR Spark 作业

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-03-14 11:23:07

从 S3 事件触发 AWS EMR Spark 作业

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-03-14 11:23:07

解决方案1
2 已采纳 2020-03-14 11:23:07