简体繁体 English

如何使用Hadoop MapReduce将数据从AWS S3导入HDFS

[英]How to import data from aws s3 to HDFS with Hadoop MapReduce

原文 2016-05-08 19:04:06 6 1 hadoop/ amazon-s3/ mapreduce

I know that Apache Hadoop provided discp to copy files from aws s3 to HDFS. 我知道Apache Hadoop提供了discp来将文件从discp s3复制到HDFS。 But seems that it is not that efficient, and the logging is inflexible. 但是似乎效率不高，并且记录不灵活。

In my project, it is required to write log in our customized format after each file transfer to HDFS succeeds or fails. 在我的项目中，每次向HDFS传输文件成功或失败之后，都需要以我们的自定义格式编写日志。 Due to the big amount of data loading, it is definitely the most efficient to load aws data into HDFS cluster with Hadoop MapReduce, say I am going to write a Hadoop MapReduce job similar to discp . 由于要加载大量数据，因此使用Hadoop MapReduce将aws数据加载到HDFS集群中绝对是最有效的方法，也就是说我要编写类似于discp的Hadoop MapReduce作业。

My plan is to let each Mapper on each node to load one s3 directory with aws Java SDK as there are many s3 directories to be loaded to HDFS. 我的计划是让每个节点上的每个Mapper使用aws Java SDK加载一个s3目录，因为有许多s3目录要加载到HDFS。 Could anyone give some suggestion about how to achieve this goal? 谁能提出一些有关如何实现这一目标的建议？ Thanks in advance! 提前致谢！

1 个解决方案

Have you tried s3a , s3a is a successor to the orignal s3n - removes some limitations (file size) and improves performance? 您是否尝试过s3a ， s3a是原始s3n的继承者-消除了一些限制（文件大小）并提高了性能？ Also what seems to be the problem with distcp - which filesystem are you using for S3 ( s3n or s3a ?)? 另外distcp似乎是什么问题-您正在为S3使用哪个文件系统（ s3n或s3a ？）？ There has been some amount of work done recently in distcp - it might worth checking the newest version. 最近在distcp做了一些工作-值得检查最新版本。