简体   繁体   中英

Can't sync s3 with ec2 folder from aws lambda

I am trying to automate data processing using AWS. I have setup an AWS lambda function in python that:

  1. Gets triggered by an S3 PUT event
  2. Ssh into an EC2 instance using paramiko layer
  3. Copy the new objects from the bucket into some folder in the instance, unzip the file inside the instance and run a python script that cleans the csv files.

The problem is the aws cli call to sync s3 bucket with ec2 folder is not working, but when I manually ssh into the ec2 instance and runn the command it works.My aws-cli is configured with my access_keys and the ec2 has an s3 role that allows it full access.

    import boto3
    import time
    import paramiko

    def lambda_handler(event, context):
    #create a low level client representing s3
        s3 = boto3.client('s3')
        ec2 = boto3.resource('ec2', region_name='eu-west-a')
        instance_id = 'i-058456c79fjcde676'
        instance = ec2.Instance(instance_id)
    ------------------------------------------------------'''
    #start instance
        instance.start()
    #allow some time for the instance to start
        time.sleep(30)

    # Print few details of the instance
       print("Instance id - ", instance.id)
       print("Instance public IP - ", instance.public_ip_address)
       print("Instance private IP - ", instance.private_ip_address)
       print("Public dns name - ", instance.public_dns_name)
       print("----------------------------------------------------")
       print('Downloading pem file')
       s3.download_file('some_bucket', 'some_pem_file.pem', '/tmp/some_pem_file.pem')

    # Allowing few seconds for the download to complete
       print('waiting for instance to start')
       time.sleep(30)
       print('sshing to instsnce')
       ssh = paramiko.SSHClient()
       ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
       privkey = paramiko.RSAKey.from_private_key_file('/tmp/some_pem_file.pem')
    # username is most likely 'ec2-user' or 'root' or 'ubuntu'
    # depending upon yor ec2 AMI
    #s3_path = "s3://some_bucket/" + object_name
       ssh.connect(
       instance.public_dns_name, username='ubuntu', pkey=privkey)
       print('inside machine...running commands')
       stdin, stdout, stderr = ssh.exec_command('aws s3 sync s3://some_bucket/ ~/ec2_folder;\
       bash ~/ec2_folder/unzip.sh; python3 ~/ec2_folder/process.py;')
       stdin.flush()
       data = stdout.read().splitlines()
       for line in data:
        print(line)
        print('done, closing ssh session')
       ssh.close()

    # Stop the instance
      instance.stop()

    return('Triggered')

The use of an SSH tool is somewhat unusual.

Here are a few more 'cloud-friendly' options you might consider.

Systems Manager Run Command

The AWS Systems Manager Run Command allows you to execute a script on an Amazon EC2 instance (and, in fact, on any computer that is running the Systems Manager agent). It can even run the command on many (hundreds!) of instances/computers at the same time, keeping track of the success of each execution.

This means that, instead of connecting to the instance via SSH, the Lambda function could call the Run Command via an API call and Systems Manager would run the code on the instance.

Pull, Don't Push

Rather than 'pushing' the work to the instance, the instance could 'pull the work':

  • Configure the Amazon S3 event to push a message into an Amazon SQS queue
  • Code on the instance could be regularly polling the SQS queue
  • When it finds a message on the queue, it runs a script that downloads the file (the bucket and key are passed in the message) and then runs the processing script

Trigger via HTTP

The instance could run a web server, listening for a message.

  • Configure the Amazon S3 event to push a message into an Amazon SNS topic
  • Add the instance's URL as an HTTP subscription to the SNS topic
  • When a message is sent to SNS, it forwards it to the instance's URL
  • Code in the web server then triggers your script

This answer is based on the additional information that you wish to shutdown the EC2 instance between executions .

I would recommend:

  • Amazon S3 Event triggers Lambda function
  • Lambda function starts the instance , passing filename information via the User Data field (it can be used to pass data, not just scripts). The Lambda function can then immediately exit (which is more cost-effective than waiting for the job to complete)
  • Put your processing script in the /var/lib/cloud/scripts/per-boot/ directory, which will cause it to run every time the instance is started ( every time, not just the first time)
  • The script can extract the User Data passed from the Lambda function by retrieving curl http://169.254.169.254/latest/user-data/ , so that it knows the filename from S3
  • The script then processes the file
  • The script then runs sudo shutdown now -h to stop the instance

If there is a chance that another file might come while the instance is already processing a file , then I would slightly change the process:

  • Rather than passing the filename via User Data, put it into an Amazon SQS queue
  • When the instance is started, it should retrieve the details from the SQS queue
  • After the file is processed, it should check the queue again to see if another message has been sent
    • If yes, the process the file and repeat
    • If no, shutdown itself

By the way, things can sometimes go wrong, so it's worth putting a 'circuit breaker' in the script so that it does not shutdown the instance if you want to debug things. This could be a matter of passing a flag, or even adding a tag to the instance, which is checked before calling the shutdown command.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM