通过 AWS Lambda 使用 python 代码的 EMR 火花作业

Question

I would like to trigger EMR spark job with python code through AWS Lambda after trigger the s3 event.I appreciate if any one can share the configuration/command to invoke the EMR spark job from AWS Lambda function. I would like to trigger EMR spark job with python code through AWS Lambda after trigger the s3 event.I appreciate if any one can share the configuration/command to invoke the EMR spark job from AWS Lambda function.

Answer 1

Since this question is very generic, I will try to give an example code for doing this.由于这个问题非常笼统，我将尝试给出一个示例代码来执行此操作。 You will have to change certain parameters based upon your actual value.您将不得不根据您的实际值更改某些参数。

The way I generally do this is I place the main handler function in one file say named as lambda_handler.py and all the configuration and steps of the EMR in a file named as emr_configuration_and_steps.py .我通常这样做的方式是将主处理程序 function 放在一个名为lambda_handler.py的文件中，并将 EMR 的所有配置和步骤放在一个名为emr_configuration_and_steps.py的文件中。

Please check the code snippet below for lambda_handler.py请检查下面的代码片段以获取lambda_handler.py

import boto3
import emr_configuration_and_steps
import logging
import traceback

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
formatter = logging.Formatter('%(levelname)s:%(name)s:%(message)s')


def create_emr(name):
    try:
        emr = boto3.client('emr')
        cluster_id = emr.run_job_flow(
            Name=name,
            VisibleToAllUsers=emr_configuration_and_steps.visible_to_all_users,
            LogUri=emr_configuration_and_steps.log_uri,
            ReleaseLabel=emr_configuration_and_steps.release_label,
            Applications=emr_configuration_and_steps.applications,
            Tags=emr_configuration_and_steps.tags,
            Instances=emr_configuration_and_steps.instances,
            Steps=emr_configuration_and_steps.steps,
            Configurations=emr_configuration_and_steps.configurations,
            ScaleDownBehavior=emr_configuration_and_steps.scale_down_behavior,
            ServiceRole=emr_configuration_and_steps.service_role,
            JobFlowRole=emr_configuration_and_steps.job_flow_role
        )
        logger.info("EMR is created successfully")
        return cluster_id['JobFlowId']
    except Exception as e:
        traceback.print_exc()
        raise Exception(e)


def lambda_handler(event, context):
    logger.info("starting the lambda function for spawning EMR")
    try:
        emr_cluster_id = create_emr('Name of Your EMR')
        logger.info("emr_cluster_id is = " + emr_cluster_id)
    except Exception as e:
        logger.error("Exception at some step in the process  " + str(e))

Now the second file( emr_configuration_and_steps.py ) that has all the configuration would look like this.现在具有所有配置的第二个文件 ( emr_configuration_and_steps.py ) 看起来像这样。

visible_to_all_users = True
log_uri = 's3://your-s3-log-path-here/'
release_label = 'emr-5.29.0'
applications = [{'Name': 'Spark'}, {'Name': 'Hadoop'}]
tags = [
    {'Key': 'Project', 'Value': 'Your-Project Name'},
    {'Key': 'Service', 'Value': 'Your-Service Name'},
    {'Key': 'Environment', 'Value': 'Development'}
]

instances = {
    'Ec2KeyName': 'Your-key-name',
    'Ec2SubnetId': 'your-subnet-name',
    'InstanceFleets': [
        {
            "InstanceFleetType": "MASTER",
            "TargetOnDemandCapacity": 1,
            "TargetSpotCapacity": 0,
            "InstanceTypeConfigs": [
                {
                    "WeightedCapacity": 1,
                    "BidPriceAsPercentageOfOnDemandPrice": 100,
                    "InstanceType": "m3.xlarge"
                }
            ],
            "Name": "Master Node"
        },
        {
            "InstanceFleetType": "CORE",
            "TargetSpotCapacity": 8,
            "InstanceTypeConfigs": [
                {
                    "WeightedCapacity": 8,
                    "BidPriceAsPercentageOfOnDemandPrice": 50,
                    "InstanceType": "m3.xlarge"
                }
            ],
            "Name": "Core Node"
        },

    ],
    'KeepJobFlowAliveWhenNoSteps': False
}
steps = [
    {
        'Name': 'Setup Hadoop Debugging',
        'ActionOnFailure': 'TERMINATE_CLUSTER',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['state-pusher-script']
        }
    },
    {
        "Name": "Active Marker for digital panel",
        "ActionOnFailure": 'TERMINATE_CLUSTER',
        'HadoopJarStep': {
            "Jar": "command-runner.jar",
            "Args": [
                "spark-submit",
                "--deploy-mode",
                "cluster",
                "--driver-memory", "4g",
                "--executor-memory", "4g",
                "--executor-cores", "2",
                "--class", "your-main-class-full-path-name",
                "s3://your-jar-path-SNAPSHOT-jar-with-dependencies.jar"
            ]
        }

    }

]

configurations = [
    {
        "Classification": "spark-log4j",
        "Properties": {
            "log4j.logger.root": "INFO",
            "log4j.logger.org": "INFO",
            "log4j.logger.com": "INFO"
        }
    }
]
scale_down_behavior = 'TERMINATE_AT_TASK_COMPLETION'
service_role = 'EMR_DefaultRole'
job_flow_role = 'EMR_EC2_DefaultRole'

Please adjust the certain path and name according to your use case.请根据您的用例调整特定路径和名称。 To deploy this you need to install boto3 and package/zip these 2 files in a zip file and upload this to your lambda function.要部署它，您需要安装 boto3 并将这两个文件打包/压缩到 zip 文件中，然后将其上传到 lambda function。 By this you should be able to spawn the EMR.通过这个，您应该能够生成 EMR。

通过 AWS Lambda 使用 python 代码的 EMR 火花作业

问题描述

1 个解决方案

解决方案1
1 2020-07-22 12:40:41

通过 AWS Lambda 使用 python 代码的 EMR 火花作业

问题描述

1 个解决方案

解决方案1 1 2020-07-22 12:40:41

解决方案1
1 2020-07-22 12:40:41