I want to add two steps together in AWS EMR cluster. Step 1 is a pyspark based code, Step 2 is Scala-spark based code.
How do I achieve this?
Running a step doesnt depend on what previous step has used as a language as long as your logic for steps is correct (input/output are as per you logic)
For example, your first step (I assume cluster is running) is with python, that can be something like taking data from MySQL/S3 and performing ETC and saving to S3 ( Notice /home/hadoop/spark/myscript.py
here)
aws emr add-steps --cluster-id j-xxxxxxx
--steps Name=Spark,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,
Args=[/home/hadoop/spark/bin/spark-submit,
--deploy-mode,client,/home/hadoop/spark/myscript.py],
ActionOnFailure=CONTINUE
Your next step can be anything, including scala. For example (notice /usr/lib/spark/examples/jars/spark-examples.jar
here)
aws emr add-steps --cluster-id j-2AXXXXXXGAPLF
--steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,
Args=[
--class,org.apache.spark.examples.SparkPi,
/usr/lib/spark/examples/jars/spark-examples.jar,10]
Now consider the command below that submits two steps, (notice the black-pace before TYPE)
Note: step names are CustomJAR
, CustomJAR2
aws emr add-steps
--cluster-id j-XXXXXXXX --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://mybucket/mytest.jar,Args=arg1,arg2,arg3 Type=CUSTOM_JAR,Name=CustomJAR2,ActionOnFailure=CONTINUE,Jar=s3://mybucket/mytest.jar,MainClass=mymainclass,Args=arg1,arg2,arg3
You just need to put your python and scala steps in place now.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.