为什么我的java.util.zip函数显示出不一致的行为？

Question

I have a Java application that uses the java.util.zip library to compress and decompress files. 我有一个使用java.util.zip库压缩和解压缩文件的Java应用程序。 What I have is a zip file on the server (created by my application) and the client zipping some of his files and uploading the file to the server, but if there's no difference in the underlying files then I don't want to waste the time uploading. 我所拥有的是服务器上的一个zip文件（由我的应用程序创建），客户端将他的一些文件压缩并上传到服务器，但是如果基础文件没有区别，那么我不想浪费时间上传。 I figured that I could calculate the MD5 hash values of the client-side and server-side and see if they're the same, but what's happening is I use my application to decompress a zip file, and then without changing any of the underlying files, I use my application to re-compress it, but the old and new zip files have different MD5 hashes. 我认为可以计算客户端和服务器端的MD5哈希值，看看它们是否相同，但是发生的是我使用我的应用程序解压缩了zip文件，然后不更改任何底层文件，我使用我的应用程序将其重新压缩，但是新旧zip文件具有不同的MD5哈希值。 Does anybody know why this is happening, and if there's a better way to compare two zip files? 是否有人知道为什么会这样，以及是否有比较两个zip文件的更好方法？ Thanks. 谢谢。

Answer 1

It's even worse, I think: 我认为更糟糕的是：

Doing the same zip-operation twice can result in two different zip-archives: 两次执行相同的zip操作可能会导致两个不同的zip归档文件：

> zip some.zip some.txt 
  adding: some.txt (stored 0%)
> zip other.zip some.txt
  adding: some.txt (stored 0%)
> ll
total 24
-rw-r--r--  1 cthies  staff  170 12 Dez 18:01 other.zip
-rw-r--r--  1 cthies  staff    4 12 Dez 18:01 some.txt
-rw-r--r--  1 cthies  staff  170 12 Dez 18:01 some.zip
> md5 *.zip
MD5 (other.zip) = f56d7753c5af78427274d930b9fb8c90
MD5 (some.zip) = e2f0382c4ad31871f62fb559157df8e8

Looking in the binaries, one can see difference in just one place: 查看二进制文件，您可以仅在一个地方看到差异：

> xxd some.zip > some.xxd
> xxd other.zip > other.xxd
> colordiff *.xxd
3c3
< 0000020: 6d65 2e74 7874 5554 0900 0363 33e6 4e78  me.txtUT...c3.Nx
---
> 0000020: 6d65 2e74 7874 5554 0900 0363 33e6 4e64  me.txtUT...c3.Nd

I think (depending on the zip-app itself) the current system time can/will be involved. 我认为（取决于zip-app本身）当前/可能会涉及系统时间。 Thus any zip-operation - on exactly the same sources - can(!) be unique and therefore the checksums can't be assumed equal. 因此，在完全相同的源上进行的任何zip操作都可以是唯一的（！），因此不能假定校验和相等。

Time-independent tools I found: tar , 7z . 我发现与时间无关的工具： tar ， 7z 。 (both command-line) Ie tar and 7z reproduces archives with equal checksums (md5). （两个命令行）即tar和7z复制具有相等校验和（md5）的档案。

(tested on OSX 10.6.8 with command-line zip utility) （在OSX 10.6.8上使用命令行zip实用程序进行了测试）

Answer 2

Just a wild shot in the dark -- are the two file systems you are calculating your hash values on differently cased? 只是在黑暗中疯狂拍摄-您正在计算两种文件系统的大小写不同吗？

That is, is one of them Windows, which treats ABC.CLASS and abc.class file names as identical, and one of the a Unix variant which treats ABC.CLASS and abc.class as different? 也就是说，其中一个Windows是否将ABC.CLASS和abc.class文件名视为相同，还是Unix变体中的一个将ABC.CLASS和abc.class视为不同的文件？

Just a wild guess... 只是一个疯狂的猜测...

EDIT: You might also look at the embedded directory separator characters / \\ . 编辑：您也可能会查看嵌入式目录分隔符/ /。 or : inside the zip file. 或：在zip文件中。

Answer 3

1) Check the time stamps on the files. 1）检查文件上的时间戳。 The files made by unziping might have a different last modified date and or creation date. 通过解压缩生成的文件的上次修改日期和/或创建日期可能不同。 That file metadata might be used to create the hash. 该文件元数据可用于创建哈希。

2) Are you using the same OS on both systems? 2）您是否在两个系统上使用相同的OS？ If the OSes are different they might be using a different character encoding. 如果操作系统不同，则它们可能使用不同的字符编码。

3) Can you diff the zip files? 3）您可以区分zip文件吗？ Different MD5 hashes should mean different data. 不同的MD5哈希值应表示不同的数据。 It will be messy but you might get some clues by comparing the raw files. 这会很混乱，但是通过比较原始文件，您可能会得到一些线索。

Answer 4

You cannot compare the resulting zip files from differing zip programs and expect them to be exactly the same, even if the exact same files were used before compression. 您无法比较来自不同zip程序的结果zip文件，并且即使压缩之前使用了完全相同的文件，也无法期望它们完全相同。

Zipping a file is not guaranteed to be deterministic between two different implementations of the zip encodings. 不能保证在两个不同的zip编码实现之间确定文件的拉链。 Zip works by replacing repeated sections of data with what amounts to a look up key. Zip的工作原理是将重复的数据部分替换为查找键。 Two different algorithms can determine the dictionary (set of repeated data) differently, in an effort to optimize the compression levels. 为了优化压缩级别，两种不同的算法可以不同地确定字典（重复数据集）。 Yet, both implementations can create valid zip files that when un-zipped result in the same file. 但是，这两种实现方式都可以创建有效的zip文件，解压缩后会生成相同的文件。

The only reliable way to do this would be to guarantee that the exact same zip algorithm is being used in both cases. 唯一可靠的方法是保证在两种情况下都使用完全相同的zip算法。

EDIT: This is why you see different compression level settings in the Java implementation of the Deflate algorithm http://download.oracle.com/javase/1.5.0/docs/api/java/util/zip/Deflater.html 编辑：这就是为什么您在Deflate算法的Java实现中看到不同的压缩级别设置的原因http://download.oracle.com/javase/1.5.0/docs/api/java/util/zip/Deflater.html

Answer 5

您正在写入一个新文件，而不是同一文件，因此从我从这样的线程中了解到，MD5将会更改： MD5哈希不可逆

为什么我的java.util.zip函数显示出不一致的行为？

问题描述

5 个解决方案

解决方案1
3 2011-12-12 17:13:16

解决方案2
1 2011-01-28 19:34:12

解决方案3
1 2011-01-28 19:44:11

解决方案4
0 2011-01-29 02:07:09

解决方案5
-2 2011-01-28 19:48:41

为什么我的java.util.zip函数显示出不一致的行为？

问题描述

5 个解决方案

解决方案1 3 2011-12-12 17:13:16

解决方案2 1 2011-01-28 19:34:12

解决方案3 1 2011-01-28 19:44:11

解决方案4 0 2011-01-29 02:07:09

解决方案5 -2 2011-01-28 19:48:41

解决方案1
3 2011-12-12 17:13:16

解决方案2
1 2011-01-28 19:34:12

解决方案3
1 2011-01-28 19:44:11

解决方案4
0 2011-01-29 02:07:09

解决方案5
-2 2011-01-28 19:48:41