I have an EMR environment that runs fine when I submit single python (pyspark) files from a local shell script ( myProgram.py
was already copied up to S3...):
aws emr add-steps \
--cluster-id $CID \
--steps Type=Spark,\
Name="chessAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[s3://$SRC_BUCKET/myProgram.py,\
--gameset1,\'$GAMESET1_URI\',\
--gameset2,\'$GAMESET2_URI\',\
--output_uri,s3://$OUT_BUCKET/V4_t]
I wish to factor some utils out of myProgram.py
into a peer file utils.py
. All the docs on this keep coming back to the --py-files
option for spark-submit
and using a zip file. I am not using spark-submit
; I am using aws emr add-steps
. Rather blindly, I tried adding --py-files
to the add-steps
launch:
aws emr add-steps \
--cluster-id $CID \
--steps Type=Spark,\
Name="chessAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[--py-files,s3://$SRC_BUCKET/myZipFile.zip,\
--gameset1,\'$GAMESET1_URI\',\
--gameset2,\'$GAMESET2_URI\',\
--output_uri,s3://$OUT_BUCKET/V4_t]
But this fails with error Error: Unrecognized option: --gameset1
What is an appropriate way to bundle multiple python files for pyspark exec on AWS EMR using the local CLI ( aws emr add-step
)?`
For passing multiple files in a step, you need to use file:// to pass the steps as a json file.
AWS CLI shorthand syntax uses comma as delimeter to separate a list of args. So when we try to pass in parameters like:
"files","s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py"
then the shorthand syntax parser will treat mapper.py and reducer.py files as two parameters.
The workaround is to use the json format. Please see the examples below.
aws emr create-cluster --steps file://./mysteps.json --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate --log-uri s3://betaestimationtest/logs
mysteps.json looks like:
[
{
"Name": "Intra country development",
"Type": "STREAMING",
"ActionOnFailure": "CONTINUE",
"Args": [
"-files",
"s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py",
"-mapper",
"mapper.py",
"-reducer",
"reducer.py",
"-input",
" s3://betaestimationtest/output_0_inte",
"-output",
" s3://betaestimationtest/output_1_intra"
]}
]
You can also find examples here: https://github.com/aws/aws-cli/blob/develop/awscli/examples/emr/create-cluster-examples.rst
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.