简体   繁体   English

将包含内容的目录从HDFS复制到本地文件系统

[英]Copy a directory with content from HDFS to local filesystem

I'm looking for a best way to copy whole directory from HDFS with all contents inside. 我正在寻找一种从HDFS复制整个目录并包含所有内容的最佳方法。 Something like: 就像是:

Path srcPath = new Path("hdfs://localhost:9000/user/britva/data");
Path dstPath = new Path("/home/britva/Work");
fs.copyToLocal(false, srcPath, dstPath);

Additionally, "data" folder can contain folders which aren't present in the "Work" directory. 此外,“数据”文件夹可以包含“工作”目录中不存在的文件夹。 So what is the best way of doing this? 那么最好的方法是什么?

Thanks for your answers! 感谢您的回答!

I suppose one of the solutions is to use FileUtil object, but not sure how to use it, as I have initialized only one fileSystem - HDFS. 我想解决方案之一是使用FileUtil对象,但不确定如何使用它,因为我仅初始化了一个fileSystem-HDFS。 Then the question is how should I initialize my local FS? 然后问题是我应该如何初始化本地FS? As I understand this util is used when you have many nodes. 据我了解,当您有许多节点时会使用此util。 But what I want - to work with local FS - to copy from HDFS to project sources. 但是我想要的-与本地FS一起使用-从HDFS复制到项目源。

Also, as I'm using Play! 另外,因为我正在使用Play! framework, would be great to use it's path, like Play.application.path + "/public/stuff" . 框架,最好使用它的路径,例如Play.application.path + "/public/stuff"

And if I'm trying to use the code above, it says: 如果我尝试使用上面的代码,它会说:

java.io.IOException: No FileSystem for scheme: file

I use scala, so here is scala example which is similar to java. 我使用scala,所以这里是类似于java的scala示例。

Step1. 第1步。 make sure your hdfs is active. 确保您的hdfs处于活动状态。 for local, just try to open 127.0.0.1:50070 对于本地,只需尝试打开127.0.0.1:50070

Step2. 第2步。 here is scala code: 这是scala代码:

val hdfsConfig = new Configuration
val hdfsURI = "127.0.0.1:9000"
val hdfs = FileSystem.get(new URI(hdfsURI), hdfsConfig)
val targetPath = new Path("127.0.0.1:9000/hdfsData")
if (hdfs.exists(targetPath)) {
  hdfs.delete(targetPath, true)
}
val oriPath = new Path(#your_local_file_path)
hdfs.copyFromLocalFile(oriPath, new Path(hdfsURI+"/"))
hdfs.close()

Step3. 第三步 for example: my local file path is : /tmp/hdfsData 例如:我的本地文件路径是:/ tmp / hdfsData

I want to copy all files under this directory, after run Step2's code, in HDFS: all files will be on "127.0.0.1:9000/hdfsData/" 在HDFS中运行Step2的代码后,我想复制此目录下的所有文件:所有文件都将位于“ 127.0.0.1:9000/hdfsData/”上

Step4. 第四步。 for copying from HDFS to local, just change "copyFromLocalFile" to "copyToLocalFile" 从HDFS复制到本地,只需将“ copyFromLocalFile”更改为“ copyToLocalFile”

If you build your project using maven regarding to 'No FileSystem for scheme' exception I had issue like this and my case was the following: 如果您使用有关“ No FileSystem for scheme”异常的maven来构建项目,则我将遇到这样的问题,我的情况如下:

Please check content of the JAR you're trying to run. 请检查您要运行的JAR的内容。 Especially META-INFO/services directory, file org.apache.hadoop.fs.FileSystem . 特别是META-INFO/services目录,文件org.apache.hadoop.fs.FileSystem There should be list of filsystem implementation classes. 应该有filsystem实现类的列表。 Check line org.apache.hadoop.hdfs.DistributedFileSystem is present in the list for HDFS and org.apache.hadoop.fs.LocalFileSystem for local file scheme. 检查HDFS列表中是否存在org.apache.hadoop.hdfs.DistributedFileSystem行,以及本地文件方案的org.apache.hadoop.fs.LocalFileSystem

If this is the case, you have to override referred resource during the build. 在这种情况下,您必须在构建过程中覆盖引用的资源。

Other possibility is you simply don't have hadoop-hdfs.jar in your classpath but this has low probability. 另一种可能性是,您只是在类路径中没有hadoop-hdfs.jar ,但是可能性很小。 Usually if you have correct hadoop-client dependency it is not an option. 通常,如果您具有正确的hadoop-client依赖关系,则不是一个选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM