简体   繁体   中英

Transfer files between S3 to EC2 using AWS Data Pipeline

I am wanting to transfer TBs of data from S3 to an EC2 Windows Server then back again which will take a couple of hours when making use of a basic AWS CLI Copy command . To help speed things up I am wanting to make use of AWS Data Pipeline and the graphic in the AWS Data Pipeline documentation seems to suggest that data can at least flow from EC2 to S3:

Yet I am finding it hard to gain an understanding on how that can be done. The closest example I have seen is the concept of staging data by making use of the ShellCommandActivity which transfers data from an S3 DataNode to an EC2 before copying it back to S3.

Instead I am wanting to copy data from S3 on an already running Windows instance and then at a later point copy further data back into S3.

This can be done without data pipeline I believe - mainly to remove some complexity. This job can be done with the AWS CLI, which is also available on Windows XP and greater. If you dont have AWS CLI on the machine look for the MSI installer.

On *nix:

aws s3 cp --recursive s3://somebucket ./

Copies s3 bucket contents to executing directory.

In short, I don't think you would be able to get any performance benefit using AWS DataPipeline for this use case.

Reason is that Task Runner (executor used by Data Pipeline) is not supported on Windows platform . So any activities you try to run, would actually run on a different platform and then you would scp/sftp it to your machine.

There are different ways to pull data into EC2 instance on other platforms:

  1. Use ShellCommandActivity: It not only allows you to upload but also download from S3 using env vairables , like INPUT1_STAGING_DIR. Though their docs do not mention about their implementation or performance improvement, I believe it is doing a parallel pull using S3's multi-part APIs.
  2. Use EMR + s3distcp and get it into HDFS and then get it to local disk from HDFS using getMerge

Do you need all of this data present on the Windows machine? Or are you going to be accessing it intermittently?

You might try just mounting your S3 bucket.

It will still be remote but will act like a normal mounted drive in Windows. If you need to do some data crunching then copy just the files you need at that moment to the local disk. You can mount S3 with S3browser,Cloudberry, or a hundred other S3 clients.

At last I was able to do in the data transfer from ec2 to s3, using datapipeline.

Steps:

  1. firstly we need to have a task runner running into ec2 machine.
  2. aws s3 wont work with your ec2 cluster , because the ec2 doesn't have the rights for your s3 buckets; use aws configure to add in access keys and secret code.
  3. create a data pipeline, with architect adding in shellcommand activity. Use script path for your sh which could have command like aws s3 cp /var/tmp/xyz s3://abc; and most important add in the worker group which denotes your task runner which you have started on ec2.

    ref: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM