简体   繁体   English

将文件从s3复制并提取到HDFS

[英]Copy and extract files from s3 to HDFS

I want to copy test.tar.gz file from S3 to HDFS. 我想将test.tar.gz文件从S3复制到HDFS。 This can be done by distcp or s3distcp. 这可以通过distcp或s​​3distcp完成。 But my requirement is when I transferring files to HDFS it should be extracted on the fly and in HDFS I should have only the extracted files not tar.gz. 但是我的要求是,当我将文件传输到HDFS时,应该动态提取它,而在HDFS中,我应该只提取文件而不是tar.gz。

Any suggestions please. 有任何建议请。

When you transfer by network, it's usually best that the files remain compressed. 通过网络传输时,通常最好保持文件压缩。 Imagine transferring a 100GB over instead of transferring a 20GB bz2 compressed file. 想象一下传输100GB而不是传输20GB bz2压缩文件。 I would suggest you to use a Hadoop API based code or a MapReduce program to extract your compressed files once the transfer is done to HDFS. 我建议您在完成向HDFS的传输后,使用基于Hadoop API的代码或MapReduce程序来提取压缩文件。 Once in HDFS, you have all the power to extract the files without copying them over to the local file system. 进入HDFS后,您将拥有提取文件的全部能力, 而无需将其复制到本地文件系统中。

  1. One solution would be to use a simple Hadoop API based code or a MapReduce code (updated) that decompresses in parallel. 一种解决方案是使用基于Hadoop API的简单代码或并行解压缩的MapReduce代码(更新)

    Addendum: For ZIP you can follow this link . 附录:对于ZIP,您可以点击此链接 And, you can come up with something similar for tar.gz. 而且,您可以为tar.gz提出类似的建议。

  2. In case you file size is huge 100GB.zip, you can probably use a Hadoop API based program which reads a stream of the Zip archive, extracts( check this link for how it was done in the ZipFileRecordReader in addendum above) and then write it back to HDFS. 如果文件大小为100GB.zip,则可以使用基于Hadoop API的程序 ,该程序读取Zip存档流,然后提取( 在上面的附录中的ZipFileRecordReader中查看此链接的处理方式),然后编写它回到HDFS。 I think, a single ZIP file is not splittable and extractable in parallel (If I'm not mistaken). 我认为,单个ZIP文件不可拆分和并行提取(如果我没记错的话)。 So, if you have a single zip archive of 100GB, you'll probably not be able to anyway unleash the full potential of a MapReduce program. 因此,如果您只有一个100GB的zip归档文件,那么您将可能无法释放MapReduce程序的全部潜力。 Hence, not point using it. 因此,不要指向使用它。

  3. Other solution is to not decompress at all. 其他解决方案是完全不解压缩。 For various built-in compressed formats, Hadoop has a command line utility that helps you view the compressed files as is if that is your intention to keep it uncompressed in HDFS. 对于各种内置的压缩​​格式,Hadoop具有一个命令行实用程序,该实用程序可帮助您按原样查看压缩文件,如果您打算在HDFS中将其保持未压缩状态。

    hadoop fs -text /path/fileinHDFS.bz2" hadoop fs -text /path/fileinHDFS.bz2“

What's the problem of using a bash script? 使用bash脚本有什么问题? I mean: 我的意思是:

s3distcp --src [file-location] --dst . #Without the hdfs prefix
tar -zxvf test.tar.gz
hadoop fs -mkdir /input
hadoop fs -mkdir /input/test
hadoop fs -copyFromLocal test/ /input/test

You should be able to achieve this with some clever piping... 您应该能够通过一些巧妙的配管来实现这一目标...

Something like this (totally untested): 像这样(完全未经测试):

s3cmd get [s3 path] - | tar -zxfO | hadoop dfs -put - [hadoop path]

s3cmd get [s3 path] - gets the file from S3 and pipes it to stdout ( - ). s3cmd get [s3 path] -从S3获取文件并将其通过管道传输到stdout( - )。 tar -zxfO gets the piped file content from stdin and extracts it to stdout (the -O option). tar -zxfO从stdin获取管道文件内容,并将其提取到stdout( -O选项)。 hadoop dfs -put - [hadoop path] puts the piped in data coming from stdin ( - ) in the provided HDFS file. hadoop dfs -put - [hadoop path]将来自标准输入( - )的管道数据放入提供的HDFS文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM