从HDFS下载大文件

Question

I was given DataInputStream from a HDFS client for a large file (around 2GB) and I need to store it as a file on my host. HDFS客户端为我提供了一个大文件（大约2GB）的DataInputStream，我需要将其作为文件存储在主机上。

I was thinking about using apache common IOUtils and doing something like this... 我正在考虑使用apache通用IOUtils并执行类似的操作...

File temp = getTempFile(localPath);
DataInputStream dis = HDFSClient.open(filepath); // around 2GB file (zipped)
in = new BufferedInputStream(dis);
out = new FileOutputStream(temp);
IOUtils.copy(in, out);

I was looking for other solutions that can work better than this approach. 我一直在寻找其他比这种方法更有效的解决方案。 Major concern for this is to use buffering in both input and IOUtils.copy ... 为此，主要要考虑的是在输入和IOUtils.copy中都使用缓冲。

Answer 1

For files larger than 2GB is recommended to use IOUtils.copyLarge() (if we are speaking about the same IOUtils: org.apache.commons.io.IOUtils ) 对于大于2GB的文件，建议使用IOUtils.copyLarge() （如果我们谈论的是相同的IOUtils： org.apache.commons.io.IOUtils ）

The copy in IOUtils uses a default buffer size of 4Kb (although you can specify another buffer size as a parameter). IOUtils中的副本使用默认的缓冲区大小4Kb（尽管您可以将另一个缓冲区大小指定为参数）。

The difference between copy() and copyLarge() is the returning result. copy()和copyLarge()之间的区别是返回结果。

For copy() , if the stream is bigger than 2GB you will succeed with the copy but the result is -1. 对于copy() ，如果流大于2GB，您将成功进行复制，但结果为-1。

For copyLarge() the result is exactly the amount of bytes you copied. 对于copyLarge() ，结果恰好是您复制的字节数。

See more in the documentation here: http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html#copyLarge(java.io.InputStream,%20java.io.OutputStream) 在此处的文档中查看更多信息： http : //commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html#copyLarge(java.io.InputStream,%20java.io。 OutputStream中）

从HDFS下载大文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-08-05 06:19:45

从HDFS下载大文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-08-05 06:19:45

解决方案1
0 已采纳 2015-08-05 06:19:45