简体   繁体   中英

Java GZip makes small differences when compressing file and decompressing it again

After a week of work I designed a binary file format, and made a Java reader for it. It's just an experiment, which works fine, unless I'm using the GZip compression function.

I called my binary type MBDF (Minimal Binary Database Format), and it can store 8 different types:

  • Integer (There is nothing like a byte, short, long or anything like that, since it is stored in flexible space (bigger numbers take more space))
  • Float-32 (32-bits floating point format, like java's float type)
  • Float-64 (64-bits floating point format, like java's double type)
  • String (A string in UTF-16 format)
  • Boolean
  • Null (Just specifies a null value)
  • Array (Something like java's ArrayList<Object> )
  • Compound (A String - Object map)

I used this data as test data:

COMPOUND {
    float1: FLOAT_32 3.3
    bool2: BOOLEAN true
    float2: FLOAT_64 3.3
    int1: INTEGER 3
    compound1: COMPOUND {
        xml: STRING "two length compound"
        int: INTEGER 23
    }
    string1: STRING "Hello world!"
    string2: STRING "3"
    arr1: ARRAY [
        STRING "Hello world!"
        INTEGER 3
        STRING "3"
        FLOAT_32 3.29
        FLOAT_64 249.2992
        BOOLEAN true
        COMPOUND {
            str: STRING "one length compound"
        }
        BOOLEAN false
        NULL null
    ]
    bool1: BOOLEAN false
    null1: NULL null
}

The xml key in a compound does matter!!

I made a file from it using this java code:

MBDFFile.writeMBDFToFile(
    "/Users/&lt;anonymous&gt;/Documents/Java/MBDF/resources/file.mbdf", 
    b.makeMBDF(false)
);

Here, the variable b is a MBDFBinary object, containing all the data given above. With the makeMBDF function it generates the ISO 8859-1 encoded string and if the given boolean is true , it compresses the string using GZip. Then, when writing, an extra information character is added at the beginning of the file, containing information about how to read it back.

Then, after writing the file, I read it back into java and parse it

MBDF mbdf = MBDFFile.readMBDFFromFile("/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf");
System.out.println(mbdf.getBinaryObject().parse());

This prints exactly the information mentioned above.

Then I try to use compression:

MBDFFile.writeMBDFToFile(
    "/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf", 
    b.makeMBDF(true)
);

I do exactly the same to read it back as I did with the uncompressed file, which should work. It prints this information:

COMPOUND {
    float1: FLOAT_32 3.3
    bool2: BOOLEAN true
    float2: FLOAT_64 3.3
    int1: INTEGER 3
    compound1: COMPOUND {
        xUT: STRING 'two length compound'
        int: INTEGER 23
    }
    string1: STRING 'Hello world!'
    string2: STRING '3'
    arr1: ARRAY [
        STRING 'Hello world!'
        INTEGER 3
        STRING '3'
        FLOAT_32 3.29
        FLOAT_64 249.2992
        BOOLEAN true
        COMPOUND {
            str: STRING 'one length compound'
        }
        BOOLEAN false
        NULL null
    ]
    bool1: BOOLEAN false
    null1: NULL null
}

Comparing it to the initial information, the name xml changed into xUT for some reason...

After some research I found little differences in binary data between before the compression and after the compression. Such patterns as 110011 change into 101010 .

When I make the name xml longer, like xmldm , it is just parsed as xmldm for some reason. I currently saw the problem only occur on names with three characters.

Directly compressing and decompressing the generated string (without saving it to a file and reading that) does work, so maybe the bug is caused by the file encoding.

As far as I know, the string output is in ISO 8859-1 format, but I couldn't get the file encoding right. When a file is read, it is read as it has to be read, and all the characters are read as ISO 8859-1 characters.

I've some things that could be a reason, I actually don't know how to test them:

  • The GZip output has a different encoding than the uncompressed encoding, causing small differences while storing as a file.
  • The file is stored as UTF-8 format, just ignoring the order to be ISO 8859-1 encoding ( don't know how to explain :) )
  • There is a little bug in the java GZip libraries.

But which one is true, and if none of them is right, what is the true reason for this bug?

I couldn't figure it out right now.

The MBDFFile class, reading and storing the files:

/* MBDFFile.java */
package com.redgalaxy.mbdf;

import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class MBDFFile {
    public static MBDF readMBDFFromFile(String filename) throws IOException {

//        FileInputStream is = new FileInputStream(filename);
//        InputStreamReader isr = new InputStreamReader(is, "ISO-8859-1");
//        BufferedReader br = new BufferedReader(isr);
//
//        StringBuilder builder = new StringBuilder();
//
//        String currentLine;
//
//        while ((currentLine = br.readLine()) != null) {
//            builder.append(currentLine);
//            builder.append("\n");
//        }
//
//        builder.deleteCharAt(builder.length() - 1);
//
//
//        br.close();

        Path path = Paths.get(filename);
        byte[] data = Files.readAllBytes(path);

        return new MBDF(new String(data, "ISO-8859-1"));
    }

    private static void writeToFile(String filename, byte[] txt) throws IOException {
//        BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
////        FileWriter writer = new FileWriter(filename);
//        writer.write(txt.getBytes("ISO-8859-1"));
//        writer.close();
//        PrintWriter pw = new PrintWriter(filename, "ISO-8859-1");
        FileOutputStream stream = new FileOutputStream(filename);
        stream.write(txt);
        stream.close();
    }

    public static void writeMBDFToFile(String filename, MBDF info) throws IOException {
        writeToFile(filename, info.pack().getBytes("ISO-8859-1"));
    }
}

The pack function generates the final string for the file, in ISO 8859-1 format.

For all the other code, see my MBDF Github repository .

I commented the code I've tried, trying to show what I tried.

My workspace: - Macbook Air '11 (High Sierra) - IntellIJ Community 2017.3 - JDK 1.8

I hope this is enough information, this is actually the only way to make clear what I'm doing, and what exactly isn't working.


Edit: MBDF.java

/* MBDF.java */
package com.redgalaxy.mbdf;

import java.io.IOException;
import java.io.UnsupportedEncodingException;

public class MBDF {

    private String data;
    private InfoTag tag;

    public MBDF(String data) {
        this.tag = new InfoTag((byte) data.charAt(0));
        this.data = data.substring(1);
    }

    public MBDF(String data, InfoTag tag) {
        this.tag = tag;
        this.data = data;
    }

    public MBDFBinary getBinaryObject() throws IOException {
        String uncompressed = data;
        if (tag.isCompressed) {
            uncompressed = GZipUtils.decompress(data);
        }
        Binary binary = getBinaryFrom8Bit(uncompressed);
        return new MBDFBinary(binary.subBit(0, binary.getLen() - tag.trailing));
    }

    public static Binary getBinaryFrom8Bit(String s8bit) {
        try {
            byte[] bytes = s8bit.getBytes("ISO-8859-1");
            return new Binary(bytes, bytes.length * 8);
        } catch( UnsupportedEncodingException ignored ) {
            // This is not gonna happen because encoding 'ISO-8859-1' is always supported.
            return new Binary(new byte[0], 0);
        }
    }

    public static String get8BitFromBinary(Binary binary) {
        try {
            return new String(binary.getByteArray(), "ISO-8859-1");
        } catch( UnsupportedEncodingException ignored ) {
            // This is not gonna happen because encoding 'ISO-8859-1' is always supported.
            return "";
        }
    }

    /*
     * Adds leading zeroes to the binary string, so that the final amount of bits is 16
     */
    private static String addLeadingZeroes(String bin, boolean is16) {
        int len = bin.length();
        long amount = (long) (is16 ? 16 : 8) - len;

        // Create zeroes and append binary string
        StringBuilder zeroes = new StringBuilder();
        for( int i = 0; i < amount; i ++ ) {
            zeroes.append(0);
        }
        zeroes.append(bin);

        return zeroes.toString();
    }

    public String pack(){
        return tag.getFilePrefixChar() + data;
    }

    public String getData() {
        return data;
    }

    public InfoTag getTag() {
        return tag;
    }

}

This class contains the pack() method. data is already compressed here (if it should be).

For other classes, please watch the Github repository, I don't want to make my question too long.

Solved it by myself!

It seemed to be the reading and writing system. When I exported a file, I made a string using the ISO-8859-1 table to turn bytes into characters. I wrote that string to a text file, which is UTF-8. The big problem was that I used FileWriter instances to write it, which are for text files.

Reading used the inverse system. The complete file was read into memory as a string (memory consuming!!) and was then being decoded.

I didn't know a file was binary data, where specific formats of them form text data. ISO-8859-1 and UTF-8 are some of those formats. I had problems with UTF-8, because it splitted some characters into two bytes, which I couldn't manage...

My solution to it was to use streams. There exist FileInputStream s and FileOutputStream s in Java, which could be used for reading and writing binary files. I didn't use the streams, as I thought there was no big difference ( "files are text, so what's the problem?" ), but there is... I implemented this (by writing a new similar library) and I'm now able to pass every input stream to the decoder and every output stream to the encoder. To make uncompressed files, you need to pass a FileOutputStream . GZipped files could use GZipOutputStream s, relying on a FileOutputStream . If someone wants a string with the binary data, a ByteArrayOutputStream could be used. Same rules apply to reading, where the InputStream variant of the mentioned streams should be used.

No UTF-8 or ISO-8859-1 problems anymore, and it seemed to work, even with GZip!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM