简体   繁体   中英

How to upload large files from HDFS to S3

I have an issue while uploading a large file (larger than 5GB) from HDFS to S3. Is there a way to upload the file directly from HDFS to S3 without downloading it to the local file system and using multipart ?

For copying data between HDFS and S3, you should use s3DistCp . s3DistCp is optimized for AWS and does an efficient copy of large number of files in parallel across S3 buckets.

For usage of s3DistCp , you can refer the document here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

The code for s3DistCp is available here: https://github.com/libin/s3distcp

If you are using Hadoop 2.7.1 or later, use the s3a:// filesystem to talk to S3. It supports multi-part uploads, which is what you need here.

Update: September 2016

I should add that we are reworking the S3A output stream work for Hadoop 2.8; the current one buffers multipart uploads in the Heap, and falls over when you are generating bulk data faster than your network can push to s3.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM