简体   繁体   中英

Python gzip and Java GZIPOutputStream give different results

I'm trying to take hash of gzipped string in Python and need it to be identical to Java's. But Python's gzip implementation seems to be different from Java's GZIPOutputStream .

Python gzip :

import gzip
import hashlib

gzip_bytes = gzip.compress(bytes('test', 'utf-8'))
gzip_hex = gzip_bytes.hex().upper()
md5 = hashlib.md5(gzip_bytes).hexdigest().upper()

>>>gzip_hex
'1F8B0800678B186002FF2B492D2E01000C7E7FD804000000'
>>>md5
'C4C763E9A0143D36F52306CF4CCC84B8'

Java GZIPOutputStream :

import java.io.ByteArrayOutputStream;
import java.util.zip.GZIPOutputStream;
import java.io.IOException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class HelloWorld{
    private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray();
    public static String bytesToHex(byte[] bytes) {
        char[] hexChars = new char[bytes.length * 2];
        for (int j = 0; j < bytes.length; j++) {
            int v = bytes[j] & 0xFF;
            hexChars[j * 2] = HEX_ARRAY[v >>> 4];
            hexChars[j * 2 + 1] = HEX_ARRAY[v & 0x0F];
        }
        return new String(hexChars);
    }
    
    public static String md5(byte[] bytes) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] thedigest = md.digest(bytes);
            return bytesToHex(thedigest);
        }
        catch (NoSuchAlgorithmException e){
            new RuntimeException("MD5 Failed", e);
        }
        return new String();
    }

     public static void main(String []args){
         String string = "test";
         final byte[] bytes = string.getBytes();
         try {
             final ByteArrayOutputStream bos = new ByteArrayOutputStream();
             final GZIPOutputStream gout = new GZIPOutputStream(bos);
             gout.write(bytes);
             gout.close();
             final byte[] encoded = bos.toByteArray();
             System.out.println("gzip: " + bytesToHex(encoded));
             System.out.println("md5: " + md5(encoded));
         }
         catch(IOException e)  {
             new RuntimeException("Failed", e);
         }
     }
}

Prints:

gzip: 1F8B08000000000000002B492D2E01000C7E7FD804000000
md5: 1ED3B12D0249E2565B01B146026C389D

So, both gzip bytes outputs seem to be very similar, but slightly different.

1F8B0800 678B186002FF 2B492D2E01000C7E7FD804000000

1F8B0800 000000000000 2B492D2E01000C7E7FD804000000

Python gzip.compress() method accepts compresslevel argument in range of 0-9. Tried all of them, but none gives desired result. Any way to get same result as Java's GZIPOutputStream in Python?

Your requirement "hash of gzipped string in Python and need it to be identical to Java's" cannot be met in general. You need to change your requirement, implementing your need differently. I would recommend requiring simply that the decompressed data have identical hashes. In fact, there is a 32-bit hash (a CRC-32) of the decompressed data already there in the two gzip strings, which are identical ( 0xd87f7e0c ). If you want a longer hash, then you can append one. The last four bytes is the uncompressed length, modulo 2 32 , so you can compare those as well. Just compare the last eight bytes of the two strings and check that they are the same.

The difference between the two gzip strings in your question illustrates the issue. One has a time stamp in the header, and the other does not (set to zeros). Even if they both had time stamps, they would still very likely be different. They also have some other bytes in the header different, like the originating operating system.

Furthermore, the compressed data in your examples is extremely short, so it just so happens to be identical in this case. However for any reasonable amount of data, the compressed data generated by two gzippers will be different, unless they happen to made with exactly the same deflate code, the same version of that code, and the same memory size and compression level settings. If you are not in control of all of those, you will never be able to assure the same compressed data coming out of them, given identical uncompressed data.

In short, don't waste your time trying to get identical compressed strings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM