简体   繁体   中英

How can I add 2 (pyspark,scala) steps together in AWS EMR?

I want to add two steps together in AWS EMR cluster. Step 1 is a pyspark based code, Step 2 is Scala-spark based code.

How do I achieve this?

Running a step doesnt depend on what previous step has used as a language as long as your logic for steps is correct (input/output are as per you logic)

For example, your first step (I assume cluster is running) is with python, that can be something like taking data from MySQL/S3 and performing ETC and saving to S3 ( Notice /home/hadoop/spark/myscript.py here)

 aws emr add-steps --cluster-id j-xxxxxxx 
--steps Name=Spark,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,
 Args=[/home/hadoop/spark/bin/spark-submit,
--deploy-mode,client,/home/hadoop/spark/myscript.py],
 ActionOnFailure=CONTINUE

Your next step can be anything, including scala. For example (notice /usr/lib/spark/examples/jars/spark-examples.jar here)

aws emr add-steps --cluster-id j-2AXXXXXXGAPLF
--steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,
 Args=[
 --class,org.apache.spark.examples.SparkPi,
 /usr/lib/spark/examples/jars/spark-examples.jar,10]

Now consider the command below that submits two steps, (notice the black-pace before TYPE)

Note: step names are CustomJAR , CustomJAR2

aws emr add-steps 
--cluster-id j-XXXXXXXX --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://mybucket/mytest.jar,Args=arg1,arg2,arg3 Type=CUSTOM_JAR,Name=CustomJAR2,ActionOnFailure=CONTINUE,Jar=s3://mybucket/mytest.jar,MainClass=mymainclass,Args=arg1,arg2,arg3

You just need to put your python and scala steps in place now.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM