简体   繁体   English

从 S3 事件触发 AWS EMR Spark 作业

[英]Triggering an AWS EMR Spark job from an S3 event

I am considering using AWS EMR Spark to run a Spark application against very large Parquet files stored on S3.我正在考虑使用 AWS EMR Spark 针对存储在 S3 上的非常大的 Parquet 文件运行 Spark 应用程序。 The overall flow here is that a Java process would upload these large files to S3, and I'd like to automatically trigger the running of a Spark job (injected with the S3 keyname(s) of the files uploaded) on those files.这里的整体流程是 Java 进程将这些大文件上传到 S3,我想自动触发在这些文件上运行 Spark 作业(注入上传文件的 S3 键名)。

Ideally, there would be some kind of S3-based EMR trigger available to wire up;理想情况下,会有某种基于 S3 的 EMR 触发器可用于连接; that is, I configure EMR/Spark to "listen" to an S3 bucket and to kick off a Spark job when an upsertis made to that bucket.也就是说,我将 EMR/Spark 配置为“侦听”S3 存储桶,并在对该存储桶进行 upsertis 时启动 Spark 作业。

If no such trigger exists, I could probably kludge something together, such as kick off a Lambda from the S3 event, and have the Lambda somehow trigger the EMR Spark job.如果不存在这样的触发器,我可能会拼凑一些东西,例如从 S3 事件中启动 Lambda,并让 Lambda 以某种方式触发 EMR Spark 作业。

However my understanding ( please correct me if I'm wrong) is that the only way to kick off a Spark job is to:然而,我的理解(如果我错了,纠正我)是启动 Spark 工作的唯一方法是:

  1. Package the job up as an executable JAR file;将作业打包为可执行 JAR 文件; and
  2. Submit it to the cluster (EMR or otherwise) via the spark-submit shell script通过spark-submit shell 脚本将其提交到集群(EMR 或其他)

So if I have to do the Lambda-based kludge, I'm not exactly sure what the best way to trigger the EMR/Spark job is, seeing that Lambdas don't natively carry spark-submit in their runtimes.因此,如果我必须执行基于 Lambda 的 kludge,我不确定触发 EMR/Spark 作业的最佳方法是什么,因为 Lambda 本身并不在其运行时中携带spark-submit And even if I configured my own Lambda runtime (which I believe is now possible to do), this solution already feels really wonky and fault-intolerant.即使我配置了自己的 Lambda 运行时(我相信现在可以做到),这个解决方案已经让人感觉非常不稳定和容错。

Anybody ever trigger an EMR/Spark job from an S3 trigger or any AWS trigger before?有人曾经从 S3 触发器或任何AWS 触发器触发过 EMR/Spark 作业吗?

EMR Spark job can be executed as a step as in Adding a Spark Step . EMR Spark 作业可以作为添加 Spark 步骤中的一个步骤执行。 Step is not just at the EMR cluster creation time after bootstrap. Step 不仅仅是在引导后的 EMR 集群创建时间。

aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/examples/jars/spark-examples.jar,10]

As it is a AWS CLI, you can invoke it from Lambda in which also you can upload the jar file to HDFS or S3, then point it using s3:// or hdfs://.由于它是一个 AWS CLI,您可以从 Lambda 调用它,您也可以在其中将 jar 文件上传到 HDFS 或 S3,然后使用 s3:// 或 hdfs:// 指向它。

The document also has a Java example.该文档还有一个 Java 示例。

AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(credentials);

StepFactory stepFactory = new StepFactory();
AmazonElasticMapReduceClient emr = new AmazonElasticMapReduceClient(credentials);
AddJobFlowStepsRequest req = new AddJobFlowStepsRequest();
req.withJobFlowId("j-1K48XXXXXXHCB");

List<StepConfig> stepConfigs = new ArrayList<StepConfig>();

HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
            .withJar("command-runner.jar")
            .withArgs("spark-submit","--executor-memory","1g","--class","org.apache.spark.examples.SparkPi","/usr/lib/spark/examples/jars/spark-examples.jar","10");            

StepConfig sparkStep = new StepConfig()
            .withName("Spark Step")
            .withActionOnFailure("CONTINUE")
            .withHadoopJarStep(sparkStepConf);

stepConfigs.add(sparkStep);
req.withSteps(stepConfigs);
AddJobFlowStepsResult result = emr.addJobFlowSteps(req);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM