Python gzip 和 Java GZIPOutputStream 给出不同的结果

Question

I'm trying to take hash of gzipped string in Python and need it to be identical to Java's.我正在尝试在 Python 中采用 hash 的压缩字符串，并需要它与 Java 的相同。 But Python's gzip implementation seems to be different from Java's GZIPOutputStream .但是 Python 的gzip实现似乎与 Java 的GZIPOutputStream不同。

Python gzip : gzip压缩包：

import gzip
import hashlib

gzip_bytes = gzip.compress(bytes('test', 'utf-8'))
gzip_hex = gzip_bytes.hex().upper()
md5 = hashlib.md5(gzip_bytes).hexdigest().upper()

>>>gzip_hex
'1F8B0800678B186002FF2B492D2E01000C7E7FD804000000'
>>>md5
'C4C763E9A0143D36F52306CF4CCC84B8'

Java GZIPOutputStream : Java GZIPOutputStream ：

import java.io.ByteArrayOutputStream;
import java.util.zip.GZIPOutputStream;
import java.io.IOException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class HelloWorld{
    private static final char[] HEX_ARRAY = "0123456789ABCDEF".toCharArray();
    public static String bytesToHex(byte[] bytes) {
        char[] hexChars = new char[bytes.length * 2];
        for (int j = 0; j < bytes.length; j++) {
            int v = bytes[j] & 0xFF;
            hexChars[j * 2] = HEX_ARRAY[v >>> 4];
            hexChars[j * 2 + 1] = HEX_ARRAY[v & 0x0F];
        }
        return new String(hexChars);
    }
    
    public static String md5(byte[] bytes) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] thedigest = md.digest(bytes);
            return bytesToHex(thedigest);
        }
        catch (NoSuchAlgorithmException e){
            new RuntimeException("MD5 Failed", e);
        }
        return new String();
    }

     public static void main(String []args){
         String string = "test";
         final byte[] bytes = string.getBytes();
         try {
             final ByteArrayOutputStream bos = new ByteArrayOutputStream();
             final GZIPOutputStream gout = new GZIPOutputStream(bos);
             gout.write(bytes);
             gout.close();
             final byte[] encoded = bos.toByteArray();
             System.out.println("gzip: " + bytesToHex(encoded));
             System.out.println("md5: " + md5(encoded));
         }
         catch(IOException e)  {
             new RuntimeException("Failed", e);
         }
     }
}

Prints:印刷：

gzip: 1F8B08000000000000002B492D2E01000C7E7FD804000000
md5: 1ED3B12D0249E2565B01B146026C389D

So, both gzip bytes outputs seem to be very similar, but slightly different.因此，两个 gzip 字节输出似乎非常相似，但略有不同。

1F8B0800 678B186002FF 2B492D2E01000C7E7FD804000000 1F8B0800 678B186002FF 2B492D2E01000C7E7FD804000000

1F8B0800 000000000000 2B492D2E01000C7E7FD804000000 1F8B0800 000000000000 2B492D2E01000C7E7FD804000000

Python gzip.compress() method accepts compresslevel argument in range of 0-9. Python gzip.compress()方法接受 0-9 范围内的compresslevel参数。 Tried all of them, but none gives desired result.尝试了所有这些，但没有一个给出预期的结果。 Any way to get same result as Java's GZIPOutputStream in Python?有什么方法可以在 Python 中获得与 Java 的GZIPOutputStream相同的结果？

Answer 1

Your requirement "hash of gzipped string in Python and need it to be identical to Java's" cannot be met in general.通常无法满足您的要求“Python 中的 gzip 字符串哈希并需要它与 Java 相同”。 You need to change your requirement, implementing your need differently.你需要改变你的需求，以不同的方式实现你的需求。 I would recommend requiring simply that the decompressed data have identical hashes.我建议只要求解压缩的数据具有相同的哈希值。 In fact, there is a 32-bit hash (a CRC-32) of the decompressed data already there in the two gzip strings, which are identical ( 0xd87f7e0c ).事实上，两个 gzip 字符串中已经存在解压缩数据的 32 位 hash（CRC-32），它们是相同的（ 0xd87f7e0c ）。 If you want a longer hash, then you can append one.如果你想要一个更长的hash，那么你可以append一个。 The last four bytes is the uncompressed length, modulo 2 ³² , so you can compare those as well.最后四个字节是未压缩的长度，模 2 ³² ，因此您也可以比较它们。 Just compare the last eight bytes of the two strings and check that they are the same.只需比较两个字符串的最后八个字节并检查它们是否相同。

The difference between the two gzip strings in your question illustrates the issue.您问题中两个 gzip 字符串之间的区别说明了这个问题。 One has a time stamp in the header, and the other does not (set to zeros).一个在 header 中有时间戳，另一个没有（设置为零）。 Even if they both had time stamps, they would still very likely be different.即使他们都有时间戳，他们仍然很可能是不同的。 They also have some other bytes in the header different, like the originating operating system.它们在 header 中也有一些其他字节不同，例如原始操作系统。

Furthermore, the compressed data in your examples is extremely short, so it just so happens to be identical in this case.此外，您的示例中的压缩数据非常短，因此在这种情况下它恰好是相同的。 However for any reasonable amount of data, the compressed data generated by two gzippers will be different, unless they happen to made with exactly the same deflate code, the same version of that code, and the same memory size and compression level settings.然而，对于任何合理数量的数据，两个 gzippers 生成的压缩数据将是不同的，除非它们碰巧使用完全相同的 deflate 代码、相同版本的代码以及相同的 memory 大小和压缩级别设置。 If you are not in control of all of those, you will never be able to assure the same compressed data coming out of them, given identical uncompressed data.如果您无法控制所有这些，那么在给定相同的未压缩数据的情况下，您将永远无法确保从它们中输出相同的压缩数据。

In short, don't waste your time trying to get identical compressed strings.简而言之，不要浪费时间尝试获得相同的压缩字符串。

Python gzip 和 Java GZIPOutputStream 给出不同的结果

问题描述

1 个解决方案

解决方案1
1 2021-02-02 00:13:19

Python gzip 和 Java GZIPOutputStream 给出不同的结果

问题描述

1 个解决方案

解决方案1 1 2021-02-02 00:13:19

解决方案1
1 2021-02-02 00:13:19