简体   繁体   中英

Storing plain text and byte information in the same file - Conversion problems

I am supposed to develop a subsystem to store certain business data in a file and I am running into a problem, but first some requirements I have:

  • It has to be 1 file for the entire data.
  • The data contains both plain text which should be human readable and byte data.
  • The byte data could be huge (and growing in the future) so I should make it small if possible.

I thought I just put everything in a String, encode it with UTF8 (a format that will not go away any time soon) and write it to a file. Problem is, UTF8 does not allow certain byte combinations and changes them when I later read the file again.

Here is a sample code that shows the problem:

    // The charset we use to encode the strings / file
    Charset charSet = StandardCharsets.UTF_8;

    // The byte data we want to store (as ints here because in the app it is used as ints)
    int idsToStore[] = new int[] {360, 361, 390, 391};

    // We transform our ints to bytes
    byte[] bytesToStore = new byte[idsToStore.length * 4];
    for (int i = 0; i < idsToStore.length; i++) {
        int id = idsToStore[i];
        bytesToStore[i * 4 + 0] = (byte) ((id >> 24) & 0xFF);
        bytesToStore[i * 4 + 1] = (byte) ((id >> 16) & 0xFF);
        bytesToStore[i * 4 + 2] = (byte) ((id >> 8) & 0xFF);
        bytesToStore[i * 4 + 3] = (byte) (id & 0xFF);
    }
    // We transform our bytes to a String
    String stringToStore = new String(bytesToStore, charSet);

    System.out.println("idsToStore="+Arrays.toString(idsToStore));
    System.out.println("BytesToStore="+Arrays.toString(bytesToStore));
    System.out.println("StringToStore="+stringToStore);
    System.out.println();

    // We load our bytes from the "file" (in this case a String, but its the same result)
    byte[] bytesLoaded = stringToStore.getBytes(charSet);
    // Just to check we see if the resulting String is identical
    String stringLoaded = new String(bytesLoaded, charSet);

    // We transform our bytes back to ints
    int[] idsLoaded = new int[bytesLoaded.length / 4];
    int readPos = 0;
    for (int i = 0; i < idsLoaded.length; i++) {
        byte b1 = bytesLoaded[readPos++];
        byte b2 = bytesLoaded[readPos++];
        byte b3 = bytesLoaded[readPos++];
        byte b4 = bytesLoaded[readPos++];
        idsLoaded[i] = (b4 & 0xFF) | (b3 & 0xFF) << 8 | (b2 & 0xFF) << 16 | (b1 & 0xFF) << 24;
    }

    System.out.println("BytesLoaded="+Arrays.toString(bytesLoaded));
    System.out.println("StringLoaded="+stringLoaded);
    System.out.println("idsLoaded="+Arrays.toString(idsLoaded));
    System.out.println();

    // We check everything
    System.out.println("Bytes equal: "+Arrays.equals(bytesToStore, bytesLoaded));
    System.out.println("Strings equal: "+stringToStore.equals(stringLoaded));
    System.out.println("IDs equal: "+Arrays.equals(idsToStore, idsLoaded));

The output with UTF8 is:

    idsToStore=[360, 361, 390, 391]
    BytesToStore=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -122, 0, 0, 1, -121]
    StringToStore=(can not be pasted into SO)

    idsLoaded=[360, 361, 495, -1078132736, 32489405]
    BytesLoaded=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -17, -65, -67, 0, 0, 1, -17, -65, -67]
    StringLoaded=(can not be pasted into SO)

    Bytes equal: false
    Strings equal: true
    IDs equal: false

If I change the Charset to UTF16BE (<- BE is Big Endian) this test works! The problem is, I am not sure if UTF16BE just works for this test "by chance". I need to know whether it will always work or not. Or perhaps there is a better way.

I am thankful for any recommendations. Thanks in advance.

The only way to ensure if your charset will always work is to test it with the entire ASCII table: Write an array of bytes containing all the 256 possible values, and test if it was correctly read.

But, going back to the root of the problem, I doubt that coding all the data into a string will work well. String is an Unicode structure, oriented to contain readable text (ie it might not contain some characters under the 32 ascii code).

Instead, I would think of a BINARY structured file: Being binary, you ensure that it can contain anything transparently. And being sutructured, you ensure that you can store several kind of data on it. For example, it would be fine if you could design a structure made of segments , and each segment having a header for the length of its data. The binary segments would be read through an InputStream, and the text segments through a Reader (with the desired encoding).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM