在同一个文件中存储纯文本和字节信息-转换问题

Question

I am supposed to develop a subsystem to store certain business data in a file and I am running into a problem, but first some requirements I have: 我应该开发一个子系统来将某些业务数据存储在文件中，但是我遇到了一个问题，但是首先要满足一些要求：

It has to be 1 file for the entire data. 整个数据必须是1个文件。
The data contains both plain text which should be human readable and byte data. 数据既包含应为人类可读的纯文本，又包含字节数据。
The byte data could be huge (and growing in the future) so I should make it small if possible. 字节数据可能很大（并且将来会不断增长），因此，如果可能的话，我应该使其变小。

I thought I just put everything in a String, encode it with UTF8 (a format that will not go away any time soon) and write it to a file. 我以为我只是将所有内容都放在一个字符串中，用UTF8（一种不会很快消失的格式）进行编码，然后将其写入文件中。 Problem is, UTF8 does not allow certain byte combinations and changes them when I later read the file again. 问题是，UTF8不允许某些字节组合，并在以后再次读取文件时更改它们。

Here is a sample code that shows the problem: 这是显示问题的示例代码：

    // The charset we use to encode the strings / file
    Charset charSet = StandardCharsets.UTF_8;

    // The byte data we want to store (as ints here because in the app it is used as ints)
    int idsToStore[] = new int[] {360, 361, 390, 391};

    // We transform our ints to bytes
    byte[] bytesToStore = new byte[idsToStore.length * 4];
    for (int i = 0; i < idsToStore.length; i++) {
        int id = idsToStore[i];
        bytesToStore[i * 4 + 0] = (byte) ((id >> 24) & 0xFF);
        bytesToStore[i * 4 + 1] = (byte) ((id >> 16) & 0xFF);
        bytesToStore[i * 4 + 2] = (byte) ((id >> 8) & 0xFF);
        bytesToStore[i * 4 + 3] = (byte) (id & 0xFF);
    }
    // We transform our bytes to a String
    String stringToStore = new String(bytesToStore, charSet);

    System.out.println("idsToStore="+Arrays.toString(idsToStore));
    System.out.println("BytesToStore="+Arrays.toString(bytesToStore));
    System.out.println("StringToStore="+stringToStore);
    System.out.println();

    // We load our bytes from the "file" (in this case a String, but its the same result)
    byte[] bytesLoaded = stringToStore.getBytes(charSet);
    // Just to check we see if the resulting String is identical
    String stringLoaded = new String(bytesLoaded, charSet);

    // We transform our bytes back to ints
    int[] idsLoaded = new int[bytesLoaded.length / 4];
    int readPos = 0;
    for (int i = 0; i < idsLoaded.length; i++) {
        byte b1 = bytesLoaded[readPos++];
        byte b2 = bytesLoaded[readPos++];
        byte b3 = bytesLoaded[readPos++];
        byte b4 = bytesLoaded[readPos++];
        idsLoaded[i] = (b4 & 0xFF) | (b3 & 0xFF) << 8 | (b2 & 0xFF) << 16 | (b1 & 0xFF) << 24;
    }

    System.out.println("BytesLoaded="+Arrays.toString(bytesLoaded));
    System.out.println("StringLoaded="+stringLoaded);
    System.out.println("idsLoaded="+Arrays.toString(idsLoaded));
    System.out.println();

    // We check everything
    System.out.println("Bytes equal: "+Arrays.equals(bytesToStore, bytesLoaded));
    System.out.println("Strings equal: "+stringToStore.equals(stringLoaded));
    System.out.println("IDs equal: "+Arrays.equals(idsToStore, idsLoaded));

The output with UTF8 is: 使用UTF8的输出是：

    idsToStore=[360, 361, 390, 391]
    BytesToStore=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -122, 0, 0, 1, -121]
    StringToStore=(can not be pasted into SO)

    idsLoaded=[360, 361, 495, -1078132736, 32489405]
    BytesLoaded=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -17, -65, -67, 0, 0, 1, -17, -65, -67]
    StringLoaded=(can not be pasted into SO)

    Bytes equal: false
    Strings equal: true
    IDs equal: false

If I change the Charset to UTF16BE (<- BE is Big Endian) this test works! 如果我将字符集更改为UTF16BE（<-BE是Big Endian），则此测试有效！ The problem is, I am not sure if UTF16BE just works for this test "by chance". 问题是，我不确定UTF16BE是否只是“偶然”地用于此测试。 I need to know whether it will always work or not. 我需要知道它是否将始终有效。 Or perhaps there is a better way. 也许有更好的方法。

I am thankful for any recommendations. 感谢您的任何建议。 Thanks in advance. 提前致谢。

Answer 1

The only way to ensure if your charset will always work is to test it with the entire ASCII table: Write an array of bytes containing all the 256 possible values, and test if it was correctly read. 确保您的字符集始终可用的唯一方法是使用整个ASCII表对其进行测试：编写一个包含所有256个可能值的字节数组，并测试是否正确读取了它。

But, going back to the root of the problem, I doubt that coding all the data into a string will work well. 但是，回到问题的根源，我怀疑将所有数据编码为字符串是否能正常工作。 String is an Unicode structure, oriented to contain readable text (ie it might not contain some characters under the 32 ascii code). 字符串是一种Unicode结构，旨在包含可读文本（即，在32 ascii代码下，它可能不包含某些字符）。

Instead, I would think of a BINARY structured file: Being binary, you ensure that it can contain anything transparently. 相反，我会想到一个BINARY结构的文件：作为二进制文件，请确保它可以透明地包含任何内容。 And being sutructured, you ensure that you can store several kind of data on it. 并且结构化之后，您确保可以在其上存储多种数据。 For example, it would be fine if you could design a structure made of segments , and each segment having a header for the length of its data. 例如，如果您可以设计一个由segments组成的结构，并且每个segment都有其数据长度的标头，那将是很好的。 The binary segments would be read through an InputStream, and the text segments through a Reader (with the desired encoding). 将通过InputStream读取二进制段，并通过Reader（具有所需的编码）读取文本段。

在同一个文件中存储纯文本和字节信息-转换问题

问题描述

1 个解决方案

解决方案1
2 2015-08-09 16:36:37

在同一个文件中存储纯文本和字节信息-转换问题

问题描述

1 个解决方案

解决方案1 2 2015-08-09 16:36:37

解决方案1
2 2015-08-09 16:36:37