將未定義的 unicode 字符讀/寫到文件中

Question

精簡版：

將包含值 55296 和 57343（含）之間的字符的字符串寫入文件會導致這些字符被替換為問號 ('?')，這意味着無法將生成的文件讀回原始字符串。

如何避免/規避這個問題？

長版：

我有一個介於 0 和 65535（含）之間的整數數組，我試圖將其寫入文件，以便以后可以檢索該數組。 下面的示例數組：

int[] integerArray = new int[] {2404,44698,55597,17382,35641,10988};

了解 unicode 正好有 65536 個字符，並且 Java 的 char 數據也有 65536 個值，我決定通過將這些整數中的每一個轉換為一個字符，將它們添加到一個字符串，然后將該字符串寫入一個文件，如下所示：

private static void writeUnicodeToFile(int[] array, String path) {
    String unicode = "";
    for(int integer : array) unicode += (char) integer;
    try { Files.write(Path.of(path), unicode.getBytes());
    } catch(IOException e) { e.printStackTrace(); }
}

這在大多數情況下都很有效，但是在示例數組的情況下，轉換值 55597 會返回問號 ('?') 字符，因此當我嘗試像這樣檢索值時：

private static int[] getUnicodeFromFile(String path) {
    String unicode = "";
    try { unicode = Files.readString(Path.of(path));
    } catch(IOException e) { e.printStackTrace(); }
    int[] integerArray = new int[unicode.length()];
    for(int i = 0; i < unicode.length(); i++) integerArray[i] = (int) unicode.charAt(i);
    return integerArray;
}

返回的數組在索引 2 處包含 63（問號的 unicode id）而不是 55597。

在 unicode 字符表中搜索值 55597 發現它不是有效字符，進一步的實驗表明 unicode 中有 2049 個字符將作為問號寫入文件（包括問號本身）。 所有其他字符都可以很好地寫入和讀取。

但是，這意味着無法將文件解碼回 integer 數組，因為一個字符可以解釋為其他 2048 個字符。

那么如何區分所有這些字符，以便我可以在文件中讀取/寫入它們而不會相互干擾呢？

或者，我可以以某種方式使用不同的字符集來完全避免這個問題嗎？ 我可以靈活地將每個 integer 轉換為 0 到 255 之間的兩個值，並使用其他一些文本文件編碼設置。

其他可能相關的信息：

我正在使用 Eclipse IDE，並將我的工作區的文本文件編碼設置為 UTF-8（我理解為等同於 unicode）。

Answer 1

好的，在這里操作：

我有一個可行的解決方案，但它不是最佳的，所以我仍然會感謝人們擁有的任何其他解決方案。

我沒有嘗試將 unicode 直接寫入文件，而是使用以下代碼將每個 integer 轉換為兩個字節：

//Get bytes from the data
private static byte[] decombineData(int[] data) {
    byte[] dataBytes = new byte[data.length*2]; // Create an array of bytes to write to
    //Write to each byte using the data in the string
    for(int i = 0; i < data.length; i++) {
        byte[] twoBytes = decombineBytes((int)data[i]); // Decombine the bytes of a character
        dataBytes[i*2] = twoBytes[0];
        dataBytes[i*2+1] = twoBytes[1];
    }
    return dataBytes; // Return the array of bytes
}

//Take a value from 0 to 65536 inclusively and return two bytes that would combine to make that value
private static byte[] decombineBytes(int combinedValue) {
    return new byte[] {(byte)(combinedValue/256-128), (byte)(combinedValue%256-128)};
}

在整數數組上調用 decombineData 將產生一個字節數組。 這是反向應用它的方法：

// Turn an array of compressed bytes into a set of integers
private static int[] combineData(byte[] bytes) {
    int[] integers = new int[(bytes.length+1)/2]; // Initialise the return string
    for(int i = 0; i < bytes.length; i+=2) { // Create the return string from the compressed bytes
        if (i+1 != bytes.length) { integers[i] = combineBytes(bytes[i],bytes[i+1]); } // Compress two bytes if enough characters exist
        else { integers[i] = (combineBytes(bytes[i],(byte)0)); } // Otherwise only compress one byte
    }
    return integers;
}

//Combine two bytes into a single integer from 0 to 65536 inclusively
private static int combineBytes(byte byteOne, byte byteTwo) {
    return (byteOne*256)+byteTwo+32896;
}

要寫入文件，您可以使用與以前相同的代碼，但直接使用字節而不是.getBytes()：

private static void writeBytesToFile(byte[] bytes, String path) {
    try { Files.write(Path.of(path), bytes);
    } catch(IOException e) { e.printStackTrace(); }
}

以及類似的閱讀方法：

public static byte[] readBytesFromFile(String path) {
    try { return Files.readAllBytes(Path.of(path));
    } catch (IOException e) { e.printStackTrace(); System.exit(1); }
    return null;
}

將未定義的 unicode 字符讀/寫到文件中

問題描述

精簡版：

長版：

其他可能相關的信息：

1 個解決方案

解決方案1
0 2021-01-21 16:40:46

將未定義的 unicode 字符讀/寫到文件中

問題描述

精簡版：

長版：

其他可能相關的信息：

1 個解決方案

解決方案1 0 2021-01-21 16:40:46

解決方案1
0 2021-01-21 16:40:46