简体   繁体   中英

AWS EMR CLI add-step with multiple files

I have an EMR environment that runs fine when I submit single python (pyspark) files from a local shell script ( myProgram.py was already copied up to S3...):

aws emr add-steps \
  --cluster-id $CID \
  --steps Type=Spark,\
Name="chessAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[s3://$SRC_BUCKET/myProgram.py,\
--gameset1,\'$GAMESET1_URI\',\
--gameset2,\'$GAMESET2_URI\',\
--output_uri,s3://$OUT_BUCKET/V4_t]

I wish to factor some utils out of myProgram.py into a peer file utils.py . All the docs on this keep coming back to the --py-files option for spark-submit and using a zip file. I am not using spark-submit ; I am using aws emr add-steps . Rather blindly, I tried adding --py-files to the add-steps launch:

aws emr add-steps \
  --cluster-id $CID \
  --steps Type=Spark,\
Name="chessAnalytics",\
ActionOnFailure=CONTINUE,\
Args=[--py-files,s3://$SRC_BUCKET/myZipFile.zip,\
--gameset1,\'$GAMESET1_URI\',\
--gameset2,\'$GAMESET2_URI\',\
--output_uri,s3://$OUT_BUCKET/V4_t]

But this fails with error Error: Unrecognized option: --gameset1

What is an appropriate way to bundle multiple python files for pyspark exec on AWS EMR using the local CLI ( aws emr add-step )?`

For passing multiple files in a step, you need to use file:// to pass the steps as a json file.

AWS CLI shorthand syntax uses comma as delimeter to separate a list of args. So when we try to pass in parameters like:

"files","s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py" then the shorthand syntax parser will treat mapper.py and reducer.py files as two parameters.

The workaround is to use the json format. Please see the examples below.

aws emr create-cluster --steps file://./mysteps.json --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate --log-uri s3://betaestimationtest/logs

mysteps.json looks like:

[
    {
    "Name": "Intra country development",
    "Type": "STREAMING",
    "ActionOnFailure": "CONTINUE",
    "Args": [
        "-files",
        "s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py",
        "-mapper",
        "mapper.py",
        "-reducer",
        "reducer.py",
        "-input",
        " s3://betaestimationtest/output_0_inte",
        "-output",
        " s3://betaestimationtest/output_1_intra"
    ]}
]

You can also find examples here: https://github.com/aws/aws-cli/blob/develop/awscli/examples/emr/create-cluster-examples.rst

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM