简体   繁体   中英

Why are my java.util.zip functions showing inconsistent behavior?

I have a Java application that uses the java.util.zip library to compress and decompress files. What I have is a zip file on the server (created by my application) and the client zipping some of his files and uploading the file to the server, but if there's no difference in the underlying files then I don't want to waste the time uploading. I figured that I could calculate the MD5 hash values of the client-side and server-side and see if they're the same, but what's happening is I use my application to decompress a zip file, and then without changing any of the underlying files, I use my application to re-compress it, but the old and new zip files have different MD5 hashes. Does anybody know why this is happening, and if there's a better way to compare two zip files? Thanks.

It's even worse, I think:

Doing the same zip-operation twice can result in two different zip-archives:

> zip some.zip some.txt 
  adding: some.txt (stored 0%)
> zip other.zip some.txt
  adding: some.txt (stored 0%)
> ll
total 24
-rw-r--r--  1 cthies  staff  170 12 Dez 18:01 other.zip
-rw-r--r--  1 cthies  staff    4 12 Dez 18:01 some.txt
-rw-r--r--  1 cthies  staff  170 12 Dez 18:01 some.zip
> md5 *.zip
MD5 (other.zip) = f56d7753c5af78427274d930b9fb8c90
MD5 (some.zip) = e2f0382c4ad31871f62fb559157df8e8

Looking in the binaries, one can see difference in just one place:

> xxd some.zip > some.xxd
> xxd other.zip > other.xxd
> colordiff *.xxd
3c3
< 0000020: 6d65 2e74 7874 5554 0900 0363 33e6 4e78  me.txtUT...c3.Nx
---
> 0000020: 6d65 2e74 7874 5554 0900 0363 33e6 4e64  me.txtUT...c3.Nd

I think (depending on the zip-app itself) the current system time can/will be involved. Thus any zip-operation - on exactly the same sources - can(!) be unique and therefore the checksums can't be assumed equal.

Time-independent tools I found: tar , 7z . (both command-line) Ie tar and 7z reproduces archives with equal checksums (md5).

(tested on OSX 10.6.8 with command-line zip utility)

Just a wild shot in the dark -- are the two file systems you are calculating your hash values on differently cased?

That is, is one of them Windows, which treats ABC.CLASS and abc.class file names as identical, and one of the a Unix variant which treats ABC.CLASS and abc.class as different?

Just a wild guess...

EDIT: You might also look at the embedded directory separator characters / \\ . or : inside the zip file.

1) Check the time stamps on the files. The files made by unziping might have a different last modified date and or creation date. That file metadata might be used to create the hash.

2) Are you using the same OS on both systems? If the OSes are different they might be using a different character encoding.

3) Can you diff the zip files? Different MD5 hashes should mean different data. It will be messy but you might get some clues by comparing the raw files.

You cannot compare the resulting zip files from differing zip programs and expect them to be exactly the same, even if the exact same files were used before compression.

Zipping a file is not guaranteed to be deterministic between two different implementations of the zip encodings. Zip works by replacing repeated sections of data with what amounts to a look up key. Two different algorithms can determine the dictionary (set of repeated data) differently, in an effort to optimize the compression levels. Yet, both implementations can create valid zip files that when un-zipped result in the same file.

The only reliable way to do this would be to guarantee that the exact same zip algorithm is being used in both cases.

EDIT: This is why you see different compression level settings in the Java implementation of the Deflate algorithm http://download.oracle.com/javase/1.5.0/docs/api/java/util/zip/Deflater.html

您正在写入一个新文件,而不是同一文件,因此从我从这样的线程中了解到,MD5将会更改: MD5哈希不可逆

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM