将未定义的 unicode 字符读/写到文件中

Question

精简版：

将包含值 55296 和 57343（含）之间的字符的字符串写入文件会导致这些字符被替换为问号 ('?')，这意味着无法将生成的文件读回原始字符串。

如何避免/规避这个问题？

长版：

我有一个介于 0 和 65535（含）之间的整数数组，我试图将其写入文件，以便以后可以检索该数组。 下面的示例数组：

int[] integerArray = new int[] {2404,44698,55597,17382,35641,10988};

了解 unicode 正好有 65536 个字符，并且 Java 的 char 数据也有 65536 个值，我决定通过将这些整数中的每一个转换为一个字符，将它们添加到一个字符串，然后将该字符串写入一个文件，如下所示：

private static void writeUnicodeToFile(int[] array, String path) {
    String unicode = "";
    for(int integer : array) unicode += (char) integer;
    try { Files.write(Path.of(path), unicode.getBytes());
    } catch(IOException e) { e.printStackTrace(); }
}

这在大多数情况下都很有效，但是在示例数组的情况下，转换值 55597 会返回问号 ('?') 字符，因此当我尝试像这样检索值时：

private static int[] getUnicodeFromFile(String path) {
    String unicode = "";
    try { unicode = Files.readString(Path.of(path));
    } catch(IOException e) { e.printStackTrace(); }
    int[] integerArray = new int[unicode.length()];
    for(int i = 0; i < unicode.length(); i++) integerArray[i] = (int) unicode.charAt(i);
    return integerArray;
}

返回的数组在索引 2 处包含 63（问号的 unicode id）而不是 55597。

在 unicode 字符表中搜索值 55597 发现它不是有效字符，进一步的实验表明 unicode 中有 2049 个字符将作为问号写入文件（包括问号本身）。 所有其他字符都可以很好地写入和读取。

但是，这意味着无法将文件解码回 integer 数组，因为一个字符可以解释为其他 2048 个字符。

那么如何区分所有这些字符，以便我可以在文件中读取/写入它们而不会相互干扰呢？

或者，我可以以某种方式使用不同的字符集来完全避免这个问题吗？ 我可以灵活地将每个 integer 转换为 0 到 255 之间的两个值，并使用其他一些文本文件编码设置。

其他可能相关的信息：

我正在使用 Eclipse IDE，并将我的工作区的文本文件编码设置为 UTF-8（我理解为等同于 unicode）。

Answer 1

好的，在这里操作：

我有一个可行的解决方案，但它不是最佳的，所以我仍然会感谢人们拥有的任何其他解决方案。

我没有尝试将 unicode 直接写入文件，而是使用以下代码将每个 integer 转换为两个字节：

//Get bytes from the data
private static byte[] decombineData(int[] data) {
    byte[] dataBytes = new byte[data.length*2]; // Create an array of bytes to write to
    //Write to each byte using the data in the string
    for(int i = 0; i < data.length; i++) {
        byte[] twoBytes = decombineBytes((int)data[i]); // Decombine the bytes of a character
        dataBytes[i*2] = twoBytes[0];
        dataBytes[i*2+1] = twoBytes[1];
    }
    return dataBytes; // Return the array of bytes
}

//Take a value from 0 to 65536 inclusively and return two bytes that would combine to make that value
private static byte[] decombineBytes(int combinedValue) {
    return new byte[] {(byte)(combinedValue/256-128), (byte)(combinedValue%256-128)};
}

在整数数组上调用 decombineData 将产生一个字节数组。 这是反向应用它的方法：

// Turn an array of compressed bytes into a set of integers
private static int[] combineData(byte[] bytes) {
    int[] integers = new int[(bytes.length+1)/2]; // Initialise the return string
    for(int i = 0; i < bytes.length; i+=2) { // Create the return string from the compressed bytes
        if (i+1 != bytes.length) { integers[i] = combineBytes(bytes[i],bytes[i+1]); } // Compress two bytes if enough characters exist
        else { integers[i] = (combineBytes(bytes[i],(byte)0)); } // Otherwise only compress one byte
    }
    return integers;
}

//Combine two bytes into a single integer from 0 to 65536 inclusively
private static int combineBytes(byte byteOne, byte byteTwo) {
    return (byteOne*256)+byteTwo+32896;
}

要写入文件，您可以使用与以前相同的代码，但直接使用字节而不是.getBytes()：

private static void writeBytesToFile(byte[] bytes, String path) {
    try { Files.write(Path.of(path), bytes);
    } catch(IOException e) { e.printStackTrace(); }
}

以及类似的阅读方法：

public static byte[] readBytesFromFile(String path) {
    try { return Files.readAllBytes(Path.of(path));
    } catch (IOException e) { e.printStackTrace(); System.exit(1); }
    return null;
}

将未定义的 unicode 字符读/写到文件中

问题描述

精简版：

长版：

其他可能相关的信息：

1 个解决方案

解决方案1
0 2021-01-21 16:40:46

将未定义的 unicode 字符读/写到文件中

问题描述

精简版：

长版：

其他可能相关的信息：

1 个解决方案

解决方案1 0 2021-01-21 16:40:46

解决方案1
0 2021-01-21 16:40:46