I have an issue while uploading a large file (larger than 5GB) from HDFS to S3. Is there a way to upload the file directly from HDFS to S3 without downloading it to the local file system and using multipart ?
For copying data between HDFS and S3, you should use s3DistCp
. s3DistCp
is optimized for AWS and does an efficient copy of large number of files in parallel across S3 buckets.
For usage of s3DistCp
, you can refer the document here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
The code for s3DistCp
is available here: https://github.com/libin/s3distcp
If you are using Hadoop 2.7.1 or later, use the s3a:// filesystem to talk to S3. It supports multi-part uploads, which is what you need here.
Update: September 2016
I should add that we are reworking the S3A output stream work for Hadoop 2.8; the current one buffers multipart uploads in the Heap, and falls over when you are generating bulk data faster than your network can push to s3.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.