简体   繁体   English

使用FileUtil API将文件复制到同一hdfs中会花费太多时间

[英]Copying a file inside a same hdfs using FileUtil API is taking too much time

I have 1 HDFS and my local system from where I'm executing my program to perform a copy inside a same hdfs system. 我有1个HDFS和本地系统,从中我执行程序以在同一hdfs系统中执行复制。 Like: hadoop fs -cp /user/hadoop/SrcFile /user/hadoop/TgtFile 像: hadoop fs -cp /user/hadoop/SrcFile /user/hadoop/TgtFile

I'm using: 我正在使用:

FileUtil.copy(FileSystem srcFS,
FileStatus srcStatus,
FileSystem dstFS,
Path dst,
boolean deleteSource,
boolean overwrite,
Configuration conf) 

But something weird is happening, when I'm doing copy from command line, it just take a moment to copy but when I do it programmatically it takes a 10 - 15 minute to copy 190 mb file. 但是发生了一些奇怪的事情,当我从命令行进行复制时,只花了一点时间进行复制,但是当我以编程方式进行复制时,它需要10到15分钟才能复制190 mb的文件。

For me it look like it's streaming the data via my local system instead of streaming directly because the destination is also on the same filesystem as of source. 对我来说,它似乎是通过本地系统而不是直接流式传输数据,因为目标与源也位于同一文件系统上。

Correct me if I'm wrong and also help me to find out the best solution. 如果我错了,请纠正我,并帮助我找出最佳解决方案。

You are right in that using FileUtil.copy the streaming is passed through your program (src --> yourprogram --> dst). 您使用FileUtil.copy是正确的,流是通过程序传递的(src-> yourprogram-> dst)。 If hadoops filesystem shell API (hadoop dfs -cp) is faster than you can use the same through Runtime.exec(cmd) 如果hadoops文件系统外壳程序API(hadoop dfs -cp)的速度比您可以通过Runtime.exec(cmd)使用的速度更快

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM