Trouble using python's gzip/“How do I know what compression is being used?”

Question

Ok, so I've got an Open Source Java client/server program that uses packets to communicate. I'm trying to write a python client for said program, but the contents of the packet seem to be compressed. A quick perusal through the source code suggested gzip as the compression schema (since that was the only compression module imported in the code that I could find), but when I saved the data from one of the packets out of wireshark and tried to do

import gzip
f = gzip.open('compressed_file')
f.read()

It told me that this wasn't a gzip file because the header was wrong. Can someone advise me what I've done wrong here? Did I change or mess up the format when I saved it out? Do I need to strip away some of the extraneous data from the packet before I try running this block on it?

    if (zipped) {

        // XML encode the data and GZIP it.
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        Writer zipOut = new BufferedWriter(new OutputStreamWriter(
                new GZIPOutputStream(baos)));
        PacketEncoder.encodeData(packet, zipOut);
        zipOut.close();

        // Base64 encode the commpressed data.
        // Please note, I couldn't get anything other than a
        // straight stream-to-stream encoding to work.
        byte[] zipData = baos.toByteArray();
        ByteArrayOutputStream base64 = new ByteArrayOutputStream(
                (4 * zipData.length + 2) / 3);
        Base64.encode(new ByteArrayInputStream(zipData), base64, false);

EDIT: Ok, sorry I have the information requested here. This was gathered using Wireshark to listen in on communication between two running copies of the original program on different computers. To get the hex stream below, I used the "Copy -> Hex (Byte Stream)" option in Wireshark.

001321cdc68ff4ce46e4f00d0800450000832a85400080061e51ac102cceac102cb004f8092a9909b32c10e81cb25018f734823e00000100000000000000521f8b08000000000000005bf39681b59c85818121a0b4884138da272bb12c512f27312f5dcf3f292b35b9c47ac2b988f902c59a394c0c0c150540758c250c5c2ea5b9b9950a2e89258900aa4c201a3f000000

I know this will contain the string "Dummy Data" in it. I believe it should also contain "Jonathanb" (the player name I used to send the message) and the integer 80 (80 is the command # for "Chat" as far as I can gather from the code).

Answer 1

You could try using standard library module zlib directly -- that's what gzip uses for the compress/decompress part. If the whole packet isn't liked by the decompress function, you can try using different values of wbits and/or slicing off a few bytes off the packet's front (if you could "reverse engineer" exactly how the Java code is compressing that packet -- even just understand how many wbits is using, or whether it's putting out any prefix before the compressed data -- that would help immensely, of course).

The only likely "damage" you might have done to the file itself would be, on windows, if you had written it without specifying 'wb' to use binary mode -- writing it in "text mode" on windows would make the file unusable. Just saying...!-)

Answer 2

It would help enormously if you divulged:

(0) What leads you to the conclusion that "the contents of the packet seem to be compressed"

(1) The URLs for the (a) source and (b) documentation of the package that is writing the packets

(2) The contents of a sample packet

(a) print repr(open('file_saved_from_wireshark', 'rb').read())

(b) just in case the long trip around via wireshark is muddying the water, insert this in your Python client:

print repr(a_sample_packet)

(3) the exact error message that you got (copy/paste)

Update after OP supplied the hex dump of a packet

This code:

import binascii, sys, cStringIO, gzip, struct, zlib
# guff is allegedly a "packet", formatted as 2 hex characters per byte
guff = "001321cdc68ff4ce46e4f00d0800450000832a85400080061e51ac102cceac102cb004f8092a9909b32c10e81cb25018f734823e00000100000000000000521f8b08000000000000005bf39681b59c85818121a0b4884138da272bb12c512f27312f5dcf3f292b35b9c47ac2b988f902c59a394c0c0c150540758c250c5c2ea5b9b9950a2e89258900aa4c201a3f000000"
guff2 = binascii.unhexlify(guff)
print "raw input: len=%d repr=%r" % (len(guff2), guff2)
# gzip spec: http://www.faqs.org/rfcs/rfc1952.html
GZIP_HDR = "\x1F\x8B\x08"
gzpos = guff2.find(GZIP_HDR)
if gzpos == -1:
    print "Can't find gzip header"
    sys.exit(1)
print gzpos, "bytes before gzipped data"
gzipped = guff2[gzpos:]
packet_crc, packet_orig_len = struct.unpack("<II", gzipped[-8:])
print "packet_crc, packet_orig_len:", hex(packet_crc), packet_orig_len
fobj = cStringIO.StringIO(gzipped)
zf = gzip.GzipFile(fileobj=fobj)
payload = zf.read()
print "payload: len=%d repr=%r" % (len(payload), payload)
print "crc32(payload):", hex(zlib.crc32(payload))

produced this output (wrapped at col 80 by Windows' "Command Prompt" terminal) when run with Python 2.6.4:

raw input: len=145 repr="\x00\x13!\xcd\xc6\x8f\xf4\xceF\xe4\xf0\r\x08\x00E\x00\x
00\x83*\x85@\x00\x80\x06\x1eQ\xac\x10,\xce\xac\x10,\xb0\x04\xf8\t*\x99\t\xb3,\x1
0\xe8\x1c\xb2P\x18\xf74\x82>\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00R\x1f\x8b\x0
8\x00\x00\x00\x00\x00\x00\x00[\xf3\x96\x81\xb5\x9c\x85\x81\x81!\xa0\xb4\x88A8\xd
a'+\xb1,Q/'1/]\xcf?)+5\xb9\xc4z\xc2\xb9\x88\xf9\x02\xc5\x9a9L\x0c\x0c\x15\x05@u\
x8c%\x0c\\.\xa5\xb9\xb9\x95\n.\x89%\x89\x00\xaaL \x1a?\x00\x00\x00"
63 bytes before gzipped data
packet_crc, packet_orig_len: 0x1a204caa 63
payload: len=63 repr='\xac\xed\x00\x05w\x04\x00\x00\x00Pur\x00\x13[Ljava.lang.Ob
ject;\x90\xceX\x9f\x10s)l\x02\x00\x00xp\x00\x00\x00\x01t\x00\nDummy Data'
crc32(payload): 0x1a204caa

Comments/questions:

This packet is 145 bytes long; what happened to the idea that a packet was about 2900 bytes?
The packet is 63 bytes of as-yet-unanalysed data followed by an 82-byte gzip stream which decompresses(!) to 63 bytes. There is no data after the gzip stream -- verified by comparing the last 8 bytes of the packet with calculated gzip values. It contains the expected "Dummy Data", but userid "johnathonb" is not there (or obfuscated or encrypted).
The packet structure doesn't match the code that we guessed was being used (no XML, no base64).
The gunzipped data contains the string "java.lang.Object" which is probably symptomatic of some java serialisation protocol. Lasciate ogni speranza, voi qu'entrate .

Answer 3

It's likely to be compliant with one of RFC 1950 , 1951 , or 1952 .

Since the name is GZIP, I'd first check 1952. Then I'd try ZLIB, 1950. Finally, DEFLATE(1951).

DotNetZip is a .NET library that allows a .NET app to read data streams that comply with any of these formats. If you had a stream that complied with one of the above, you could very quickly determine which one it was, by trying to read the stream with each of DotNetZip's streams in succession; GZipStream , ZlibStream , DeflateStream . One of them will work, and the others will not.

I don't know of a Java library that has those streams. Doesn't mean it doesn't exist. Just that I don't know of one.

DotNetZip is free and works on Windows+Mono, Linux+Mono, as well as Windows+.NET.

Trouble using python's gzip/“How do I know what compression is being used?”

Question

3 answers

solution1
1 2010-01-23 05:12:20

solution2
1 ACCPTED 2010-01-23 13:19:11

solution3
0 2010-01-23 05:34:41

Trouble using python's gzip/“How do I know what compression is being used?”

Question

3 answers

solution1 1 2010-01-23 05:12:20

solution2 1 ACCPTED 2010-01-23 13:19:11

solution3 0 2010-01-23 05:34:41

solution1
1 2010-01-23 05:12:20

solution2
1 ACCPTED 2010-01-23 13:19:11

solution3
0 2010-01-23 05:34:41