简体   繁体   中英

How to run PySpark on AWS EMR with AWS Lambda

How may I make my PySpark code to run with AWS EMR from AWS Lambda? Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?

You need transient cluster for this case which will auto terminate once your job is completed or the timeout is reached whichever occurs first.

You can access this link on how to initialise the same.

What are the processes available to create a EMR cluster:

  1. Using boto3 / AWS CLI / Java SDK
  2. Using cloudformation
  3. Using Data Pipeline

Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?

No . It isn't mandatory to use lambda to create an auto-terminating cluster.

You just need to specify a flag --auto-terminate while creating a cluster using boto3 / CLi / Java-SDK. But this case you need to submit the job along with cluster config. Ref

Note:

Its not possible to create an auto-terminating cluster using cloudformation. By design, CloudFormation assumes that the resources that are being created will be permanent to some extent.

If you REALLY had to do it this way, you could make an AWS api call to delete the CF stack upon finishing your EMR tasks.

How may I make my PySpark code to run with AWS EMR from AWS Lambda?

You can design your lambda to submit spark job . You can find an example here

In my use case I have one parameterised lambda which invoke CF to create cluster, submit job and terminate cluster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM