简体   繁体   English

如何提高 java.util.zip.GZIPInputStream 解压大型 .gz 文件的性能?

[英]How to improve java.util.zip.GZIPInputStream performance to unzip a large .gz file?

I'm trying to unzip a very large.gz file in java around 50MB and then transferring it to hadoop file system.我正在尝试解压缩 java 中的一个非常大的 .gz 文件,大约 50MB,然后将其传输到 hadoop 文件系统。 After unzipping, the file size becomes 20 GB.解压缩后,文件大小变为 20 GB。 It takes more than 5 min to do this job.完成这项工作需要 5 分钟以上。

protected void write(BufferedInputStream bis, Path outputPath, FileSystem hdfs) throws IOException 
{
        BufferedOutputStream bos = new BufferedOutputStream(hdfs.create(outputPath));
        IOUtils.copyBytes(bis, bos, 8*1024);
}

Even after using Buffered I/O streams, it is taking very long to decompress and transfer the file.即使在使用缓冲 I/O 流之后,解压缩和传输文件也需要很长时间。

Does Hadoop is causing file transfer to be slow or GZIPInputStream is slow? Hadoop 是否导致文件传输缓慢或 GZIPInputStream 缓慢?

Writing 20 Gb will take time.写入 20 Gb需要时间。 If you do it in 300 seconds you still write more than 70 Mb a second.如果你在 300 秒内完成,你仍然每秒写入超过 70 Mb。

You may simply hit the limit of the platform.您可能只是达到了平台的极限。

If you rewrite your processing code to read the compressesed file that may help.如果您重写处理代码以读取可能有帮助的压缩文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM