简体   繁体   English

在同一个文件中存储纯文本和字节信息-转换问题

[英]Storing plain text and byte information in the same file - Conversion problems

I am supposed to develop a subsystem to store certain business data in a file and I am running into a problem, but first some requirements I have: 我应该开发一个子系统来将某些业务数据存储在文件中,但是我遇到了一个问题,但是首先要满足一些要求:

  • It has to be 1 file for the entire data. 整个数据必须是1个文件。
  • The data contains both plain text which should be human readable and byte data. 数据既包含应为人类可读的纯文本,又包含字节数据。
  • The byte data could be huge (and growing in the future) so I should make it small if possible. 字节数据可能很大(并且将来会不断增长),因此,如果可能的话,我应该使其变小。

I thought I just put everything in a String, encode it with UTF8 (a format that will not go away any time soon) and write it to a file. 我以为我只是将所有内容都放在一个字符串中,用UTF8(一种不会很快消失的格式)进行编码,然后将其写入文件中。 Problem is, UTF8 does not allow certain byte combinations and changes them when I later read the file again. 问题是,UTF8不允许某些字节组合,并在以后再次读取文件时更改它们。

Here is a sample code that shows the problem: 这是显示问题的示例代码:

    // The charset we use to encode the strings / file
    Charset charSet = StandardCharsets.UTF_8;

    // The byte data we want to store (as ints here because in the app it is used as ints)
    int idsToStore[] = new int[] {360, 361, 390, 391};

    // We transform our ints to bytes
    byte[] bytesToStore = new byte[idsToStore.length * 4];
    for (int i = 0; i < idsToStore.length; i++) {
        int id = idsToStore[i];
        bytesToStore[i * 4 + 0] = (byte) ((id >> 24) & 0xFF);
        bytesToStore[i * 4 + 1] = (byte) ((id >> 16) & 0xFF);
        bytesToStore[i * 4 + 2] = (byte) ((id >> 8) & 0xFF);
        bytesToStore[i * 4 + 3] = (byte) (id & 0xFF);
    }
    // We transform our bytes to a String
    String stringToStore = new String(bytesToStore, charSet);

    System.out.println("idsToStore="+Arrays.toString(idsToStore));
    System.out.println("BytesToStore="+Arrays.toString(bytesToStore));
    System.out.println("StringToStore="+stringToStore);
    System.out.println();

    // We load our bytes from the "file" (in this case a String, but its the same result)
    byte[] bytesLoaded = stringToStore.getBytes(charSet);
    // Just to check we see if the resulting String is identical
    String stringLoaded = new String(bytesLoaded, charSet);

    // We transform our bytes back to ints
    int[] idsLoaded = new int[bytesLoaded.length / 4];
    int readPos = 0;
    for (int i = 0; i < idsLoaded.length; i++) {
        byte b1 = bytesLoaded[readPos++];
        byte b2 = bytesLoaded[readPos++];
        byte b3 = bytesLoaded[readPos++];
        byte b4 = bytesLoaded[readPos++];
        idsLoaded[i] = (b4 & 0xFF) | (b3 & 0xFF) << 8 | (b2 & 0xFF) << 16 | (b1 & 0xFF) << 24;
    }

    System.out.println("BytesLoaded="+Arrays.toString(bytesLoaded));
    System.out.println("StringLoaded="+stringLoaded);
    System.out.println("idsLoaded="+Arrays.toString(idsLoaded));
    System.out.println();

    // We check everything
    System.out.println("Bytes equal: "+Arrays.equals(bytesToStore, bytesLoaded));
    System.out.println("Strings equal: "+stringToStore.equals(stringLoaded));
    System.out.println("IDs equal: "+Arrays.equals(idsToStore, idsLoaded));

The output with UTF8 is: 使用UTF8的输出是:

    idsToStore=[360, 361, 390, 391]
    BytesToStore=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -122, 0, 0, 1, -121]
    StringToStore=(can not be pasted into SO)

    idsLoaded=[360, 361, 495, -1078132736, 32489405]
    BytesLoaded=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -17, -65, -67, 0, 0, 1, -17, -65, -67]
    StringLoaded=(can not be pasted into SO)

    Bytes equal: false
    Strings equal: true
    IDs equal: false

If I change the Charset to UTF16BE (<- BE is Big Endian) this test works! 如果我将字符集更改为UTF16BE(<-BE是Big Endian),则此测试有效! The problem is, I am not sure if UTF16BE just works for this test "by chance". 问题是,我不确定UTF16BE是否只是“偶然”地用于此测试。 I need to know whether it will always work or not. 我需要知道它是否将始终有效。 Or perhaps there is a better way. 也许有更好的方法。

I am thankful for any recommendations. 感谢您的任何建议。 Thanks in advance. 提前致谢。

The only way to ensure if your charset will always work is to test it with the entire ASCII table: Write an array of bytes containing all the 256 possible values, and test if it was correctly read. 确保您的字符集始终可用的唯一方法是使用整个ASCII表对其进行测试:编写一个包含所有256个可能值的字节数组,并测试是否正确读取了它。

But, going back to the root of the problem, I doubt that coding all the data into a string will work well. 但是,回到问题的根源,我怀疑将所有数据编码为字符串是否能正常工作。 String is an Unicode structure, oriented to contain readable text (ie it might not contain some characters under the 32 ascii code). 字符串是一种Unicode结构,旨在包含可读文本(即,在32 ascii代码下,它可能不包含某些字符)。

Instead, I would think of a BINARY structured file: Being binary, you ensure that it can contain anything transparently. 相反,我会想到一个BINARY结构的文件:作为二进制文件,请确保它可以透明地包含任何内容。 And being sutructured, you ensure that you can store several kind of data on it. 并且结构化之后,您确保可以在其上存储多种数据。 For example, it would be fine if you could design a structure made of segments , and each segment having a header for the length of its data. 例如,如果您可以设计一个由segments组成的结构,并且每个segment都有其数据长度的标头,那将是很好的。 The binary segments would be read through an InputStream, and the text segments through a Reader (with the desired encoding). 将通过InputStream读取二进制段,并通过Reader(具有所需的编码)读取文本段。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM