简体   繁体   English

解压缩HTTPInputStream时GZIPInputStream过早关闭

[英]GZIPInputStream closes prematurely when decompressing HTTPInputStream

Question

See updated question in edit section below 请参阅下面编辑部分中的更新问题

I'm trying to decompress large (~300M) GZIPed files from Amazon S3 on the fly using GZIPInputStream but it only outputs a portion of the file; 我正在尝试使用GZIPInputStream动态地从Amazon S3解压缩大(~300M)GZIPed文件,但它只输出文件的一部分; however, if I download to the filesystem before decompression then GZIPInputStream will decompress the entire file. 但是,如果我在解压缩之前下载到文件系统,那么GZIPInputStream将解压缩整个文件。

How can I get GZIPInputStream to decompress the entire HTTPInputStream and not just the first part of it? 如何让GZIPInputStream解压缩整个HTTPInputStream而不只是解压缩它的第一部分?

What I've Tried 我试过的

see update in the edit section below 请参阅下面的编辑部分中的更新

I suspected a HTTP problem except that no exceptions are ever thrown, GZIPInputStream returns a fairly consistent chunk of the file each time and, as far as I can tell, it always breaks on a WET record boundary although the boundary it picks is different for each URL (which is very strange as everything is being treated as a binary stream, no parsing of the WET records in the file is happening at all.) 我怀疑HTTP问题,除了没有抛出异常,GZIPInputStream每次返回一个相当一致的文件块,据我所知,它总是在WET记录边界上打破,尽管它选择的边界是不同的URL(这很奇怪,因为所有内容都被视为二进制流,文件中的WET记录根本没有解析。)

The closest question I could find is GZIPInputStream is prematurely closed when reading from s3 The answer to that question was that some GZIP files are actually multiple appended GZIP files and GZIPInputStream doesn't handle that well. 我能找到的最接近的问题是当从s3读取时GZIPInputStream过早关闭该问题的答案是一些GZIP文件实际上是多个附加的GZIP文件而GZIPInputStream不能很好地处理。 However, if that is the case here why would GZIPInputStream work fine on a local copy of the file? 但是,如果是这种情况,为什么GZIPInputStream可以在文件的本地副本上正常工作?

Demonstration Code and Output 演示代码和输出

Below is a piece of sample code that demonstrates the problem I am seeing. 下面是一段示例代码,演示了我所看到的问题。 I've tested it with Java 1.8.0_72 and 1.8.0_112 on two different Linux computers on two different networks with similar results. 我在两个不同的网络上的两台不同的Linux计算机上用Java 1.8.0_72和1.8.0_112测试了它,结果相似。 I expect the byte count from the decompressed HTTPInputStream to be identical to the byte count from the decompressed local copy of the file, but the decompressed HTTPInputStream is much smaller. 我希望解压缩的HTTPInputStream中的字节数与文件的解压缩本地副本的字节数相同,但解压缩的HTTPInputStream要小得多。

Output 产量
 Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 87894 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile0.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 1772936 bytes from HTTP->GZIP Read 451171329 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile40.wet ------ Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 89217 bytes from HTTP->GZIP Read 453183600 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file testfile500.wet 
Sample Code 示例代码
 import java.net.*; import java.io.*; import java.util.zip.GZIPInputStream; import java.nio.channels.*; public class GZIPTest { public static void main(String[] args) throws Exception { // Our three test files from CommonCrawl URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz"); URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz"); /* * Test the URLs and display the results */ test(url0, "testfile0.wet"); System.out.println("------"); test(url40, "testfile40.wet"); System.out.println("------"); test(url500, "testfile500.wet"); } public static void test(URL url, String testGZFileName) throws Exception { System.out.println("Testing URL "+url.toString()); // First directly wrap the HTTPInputStream with GZIPInputStream // and count the number of bytes we read // Go ahead and save the extracted stream to a file for further inspection System.out.println("Testing HTTP Input Stream direct to GZIPInputStream"); int bytesFromGZIPDirect = 0; URLConnection urlConnection = url.openConnection(); FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName); // FIRST TEST - Decompress from HTTPInputStream GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream()); byte[] buffer = new byte[1024]; int bytesRead = -1; while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) { bytesFromGZIPDirect += bytesRead; directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection } gzipishttp.close(); directGZIPOutStream.close(); // Now save the GZIPed file locally System.out.println("Testing saving to file before decompression"); int bytesFromGZIPFile = 0; ReadableByteChannel rbc = Channels.newChannel(url.openStream()); FileOutputStream outputStream = new FileOutputStream("./test.wet.gz"); outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE); outputStream.close(); // SECOND TEST - decompress from FileInputStream GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz")); buffer = new byte[1024]; bytesRead = -1; while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) { bytesFromGZIPFile += bytesRead; } gzipis.close(); // The Results - these numbers should match but they don't System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP"); System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP"); System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName); } } 

Edit 编辑

Closed Stream and associated Channel in demonstration code as per comment by @VGR. 根据@VGR的评论,演示代码中的封闭流和关联频道。

UPDATE : 更新

The problem does seem to be something specific to the file. 问题似乎确实是文件特有的。 I pulled the Common Crawl WET archive down locally (wget), uncompressed it (gunzip 1.8), then recompressed it (gzip 1.8) and re-uploaded to S3 and the on-the-fly decompression then worked fine. 我在本地(wget)下载了Common Crawl WET存档,解压缩(gunzip 1.8),然后重新压缩它(gzip 1.8)并重新上传到S3,然后即时解压缩工作正常。 You can see the test if you modify the sample code above to include the following lines: 如果修改上面的示例代码以包含以下行,则可以看到测试:

 // Original file from CommonCrawl hosted on S3 URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); // Recompressed file hosted on S3 URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz"); test(originals3, "originalhost.txt"); test(rezippeds3, "rezippedhost.txt"); 

URL rezippeds3 points to the WET archive file that I downloaded, decompressed and recompressed, then re-uploaded to S3. URL rezippeds3指向我下载,解压缩和重新压缩的WET存档文件,然后重新上传到S3。 You will see the following output: 您将看到以下输出:

 Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 7212400 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file originals3.txt ----- Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz Testing HTTP Input Stream direct to GZIPInputStream Testing saving to file before decompression Read 448974935 bytes from HTTP->GZIP Read 448974935 bytes from HTTP->file->GZIP Output from HTTP->GZIP saved to file rezippeds3.txt 

As you can see once the file was recompressed I was able to stream it through GZIPInputStream and get the entire file. 正如您所看到的,一旦文件被重新压缩,我就可以通过GZIPInputStream流式传输它并获取整个文件。 The original file still shows the usual premature end of the decompression. 原始文件仍显示解压缩的通常过早结束。 When I downloaded and uploaded the WET file without recompressing it I got the same incomplete streaming behavior so it was definitely the recompression that fixed it. 当我下载并上传WET文件而不重新压缩它时,我得到了相同的不完整流式传输行为,所以它肯定是修复它的再压缩。 I also put both files, the original and the recompressed, onto a traditional Apache web server and was able to replicate the results, so S3 doesn't seem to have anything to do with the problem. 我还将原始文件和重新压缩的文件放到传统的Apache Web服务器上,并能够复制结果,因此S3似乎与问题没有任何关系。

So. 所以。 I have a new question. 我有一个新问题。

New Question 新问题

Why would a FileInputStream behave differently than a HTTPInputStream when reading the same content. 为什么FileInputStream在读取相同内容时的行为与HTTPInputStream不同。 If it is the exact same file why does: 如果它是完全相同的文件,为什么:

new GZIPInputStream(urlConnection.getInputStream()); new GZIPInputStream(urlConnection.getInputStream());

behave any differently than 表现得与...不同

new GZIPInputStream(new FileInputStream("./test.wet.gz")); new GZIPInputStream(new FileInputStream(“./ test.wet.gz”));

?? ?? Isn't an input stream just an input stream?? 输入流不是输入流吗?

Root Cause Discussion 根本原因讨论

It turns out that InputStreams can vary quite a bit. 事实证明,InputStreams可以变化很大。 In particular they differ in how they implement the .available() method. 特别是它们在实现.available()方法方面有所不同。 For example ByteArrayInputStream .available() returns the number of bytes remaining in the InputStream. 例如,ByteArrayInputStream .available()返回InputStream中剩余的字节数。 However, HTTPInputStream .available() returns the number of bytes available for reading before a blocking IO request needs to be made to refill the buffer. 但是,HTTPInputStream .available()返回在需要阻塞IO请求以重新填充缓冲区之前可读取的字节数。 (See the Java Docs for more info) (有关详细信息,请参阅Java Docs)

The problem is that GZIPInputStream uses the output of .available() to determine if there might be an additional GZIP file available in the InputStream after it finishes decompressing a complete GZIP file. 问题是GZIPInputStream使用.available()的输出来确定在完成解压缩完整的GZIP文件后,InputStream中是否有可用的额外GZIP文件。 Here is line 231 from OpenJDK source file GZIPInputStream.java method readTrailer(). 这是来自OpenJDK源文件GZIPInputStream.java方法readTrailer()的第231行。

   if (this.in.available() > 0 || n > 26) {

If the HTTPInputStream read buffer empties right at the boundary of two concatenated GZIP files GZIPInputStream calls .available(), which responds with a 0 as it would need to go out to the network to refill the buffer, and so GZIPInputStream treats the file as complete and closes prematurely. 如果HTTPInputStream读取缓冲区在两个连接的GZIP文件的边界处清空,则GZIPInputStream调用.available(),它响应为0,因为它需要到网络重新填充缓冲区,因此GZIPInputStream将文件视为完整过早关闭。

The Common Crawl .wet archives are hundreds of megabytes of small concatenated GZIP files and so eventually the HTTPInputStream buffer will empty right at the end of one of the concatenated GZIP files and GZIPInputStream will close prematurely. Common Crawl .wet存档是数百兆字节的小型连接GZIP文件,因此最终HTTPInputStream缓冲区将在其中一个连接的GZIP文件的末尾清空,GZIPInputStream将过早关闭。 This explains the problem demonstrated in the question. 这解释了问题中证明的问题。

Solution and Work Around 解决方案和解决方案

This GIST contains a patch to jdk8u152-b00 revision 12039 and two jtreg tests that remove the (in my humble opinion) incorrect reliance on .available(). 这个GIST包含一个jdk8u152-b00修订版12039的补丁和两个jtreg测试,删除了(以我的拙见)对.available()的错误依赖。

If you cannot patch the JDK a work around is to make sure that available() always returns > 0 which forces GZIPInputStream to always check for another GZIP file in the stream. 如果无法修补JDK,解决方法是确保available()始终返回> 0,这会强制GZIPInputStream始终检查流中的另一个GZIP文件。 Unfortunately HTTPInputStream is private so you cannot subclass it directly, instead extend InputStream and wrap the HTTPInputStream. 不幸的是,HTTPInputStream是私有的,所以你不能直接对它进行子类化,而是扩展InputStream并包装HTTPInputStream。 The below code demonstrates this work around. 以下代码演示了这项工作。

Demonstration Code and Output 演示代码和输出

Here is the output showing that when the HTTPInputStream is wrapped as discussed GZIPInputStream will produces identical results when reading the concatenated GZIP from a file and directly from HTTP. 这是输出显示当HTTPInputStream被包装时,如上所述,当从文件读取连接的GZIP并直接从HTTP读取时,GZIPInputStream将产生相同的结果。

Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 451171329 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 453183600 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet

Here is the demonstration code from the question modified with an InputStream wrapper. 以下是使用InputStream包装器修改的问题的演示代码。

import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;

public class GZIPTest {
    // Here is a wrapper class that wraps an InputStream
    // but always returns > 0 when .available() is called.
    // This will cause GZIPInputStream to always make another 
    // call to the InputStream to check for an additional 
    // concatenated GZIP file in the stream.
    public static class AvailableInputStream extends InputStream {
        private InputStream is;

        AvailableInputStream(InputStream inputstream) {
            is = inputstream;
        }

        public int read() throws IOException {
            return(is.read());
        }

        public int read(byte[] b) throws IOException {
            return(is.read(b));
        }

        public int read(byte[] b, int off, int len) throws IOException {
            return(is.read(b, off, len));
        }

        public void close() throws IOException {
            is.close();
        }

        public int available() throws IOException {
            // Always say that we have 1 more byte in the
            // buffer, even when we don't
            int a = is.available();
            if (a == 0) {
                return(1);
            } else {
                return(a);
            }
        }
    }



    public static void main(String[] args) throws Exception {
        // Our three test files from CommonCrawl
        URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");

        /*
         * Test the URLs and display the results
         */
        test(url0, "testfile0.wet");
        System.out.println("------");
        test(url40, "testfile40.wet");
        System.out.println("------");
        test(url500, "testfile500.wet");
    }

    public static void test(URL url, String testGZFileName) throws Exception {
        System.out.println("Testing URL "+url.toString());

        // First directly wrap the HTTP inputStream with GZIPInputStream
        // and count the number of bytes we read
        // Go ahead and save the extracted stream to a file for further inspection
        System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
        int bytesFromGZIPDirect = 0;
        URLConnection urlConnection = url.openConnection();
        // Wrap the HTTPInputStream in our AvailableHttpInputStream
        AvailableInputStream ais = new AvailableInputStream(urlConnection.getInputStream());
        GZIPInputStream gzipishttp = new GZIPInputStream(ais);
        FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
        int buffersize = 1024;
        byte[] buffer = new byte[buffersize];
        int bytesRead = -1;
        while ((bytesRead = gzipishttp.read(buffer, 0, buffersize)) != -1) {
            bytesFromGZIPDirect += bytesRead;
            directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
        }
        gzipishttp.close();
        directGZIPOutStream.close();

        // Save the GZIPed file locally
        System.out.println("Testing saving to file before decompression");
        ReadableByteChannel rbc = Channels.newChannel(url.openStream());
        FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
        outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);

        // Now decompress the local file and count the number of bytes
        int bytesFromGZIPFile = 0;
        GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));

        buffer = new byte[1024];
        while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
            bytesFromGZIPFile += bytesRead;
        }
        gzipis.close();

        // The Results
        System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
        System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
        System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
    }

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM