从US-ASCII编码的字符串中获取UTF-8编码

Question

I have a string "ChÃ¢teau" with UTF-8 encoded & it gets converted to US-ASCII format as "Ch??teau"(in the underlying lib of my app) 我有一个用UTF-8编码的字符串“Château”，它被转换为“ Ch ?? teau”（在我的应用程序的基本库中）为US-ASCII格式。

Now, I want to get the original string "ChÃ¢teau" back from "U-ASCII" converted string "Ch??teau". 现在，我想从“ U-ASCII”转换后的字符串“ Ch ?? teau”中获取原始字符串“Château”。 But, I am not able to get that using the below code. 但是，我无法使用下面的代码来得到它。

StringBuilder masterBuffer = new StringBuilder();
byte[] rawDataBuffer = (Read from InputStream) // say here it is "ChÃ¢teau"
String rawString = new String(rawDataBuffer, "UTF-8");
masterBuffer.append(rawString);
onMessageReceived(masterBuffer.toString().getBytes()) => Here, getBytes() uses the platform's default charset 'US-ASCII.

My application receives the byte array of US-ASCII encoded. 我的应用程序收到了US-ASCII编码的字节数组。 On application side, even if I try to get UTF-8 string out of it, it's of no use. 在应用程序方面，即使我尝试从中获取UTF-8字符串，也没有用。 The conversion attempt still gives "Ch??teau". 转换尝试仍给出“ Ch ?? teau”。

String asciiString = "Ch??teau";
String originalString = new String(asciiString.getBytes("UTF-8"), "UTF-8");
System.out.println("orinalString: " + originalString);

The value of 'originalString" is still "Ch??teau". “ originalString”的值仍为“ Ch ?? teau”。

Is this right way to do this ? 这是正确的方法吗？

Thanks, 谢谢，

Answer 1

You can't. 你不能 You lost information by converting to US-ASCII. 通过转换为US-ASCII，您丢失了信息。 You can't get back what was lost. 您无法找回丢失的东西。

Answer 2

Your code is receiving a UTF-8 encoded byte array, correctly converting it to a Java String , but is then converting that string to an ASCII encoded byte array. 您的代码正在接收UTF-8编码的字节数组，将其正确转换为Java String ，然后将其转换为ASCII编码的字节数组。 ASCII does not support the Ã and ¢ characters, which is why they are being converted to ? ASCII不支持Ã和¢字符，这就是为什么将它们转换为? . 。 Once that conversion has been done, there is no going back. 转换完成后，将无法返回。 ASCII is a subset of UTF-8, and ? ASCII是UTF-8的子集， ? in ASCII is also ? 在ASCII中也是? in UTF-8. 在UTF-8中。

The solution is to stop converting to ASCII to begin with. 解决方案是从一开始就停止转换为ASCII。 You should convert back to UTF-8 instead: 您应该改回为UTF-8：

StringBuilder masterBuffer = new StringBuilder();
byte[] rawDataBuffer = ...; // Read from InputStream
String rawString = new String(rawDataBuffer, "UTF-8");
masterBuffer.append(rawString);
onMessageReceived(masterBuffer.toString().getBytes("UTF-8"));

At least that way, for true ASCII characters, the receiver will never know the difference (since ASCII is a subset of UTF-8), and non-ASCII character will not be lost anymore. 至少以这种方式，对于真正的ASCII字符，接收者将永远不会知道区别（因为ASCII是UTF-8的子集），并且不再会丢失非ASCII字符。 The receiver just needs to know to expect UTF-8 and not ASCII. 接收者只需要知道期望使用UTF-8而不是ASCII。 And, your code will be more portable, since you will no longer be dependent on a platform-specific default charset (not all platforms use ASCII by default). 并且，您的代码将更加可移植，因为您将不再依赖于特定于平台的默认字符集（默认情况下，并非所有平台都使用ASCII）。

Of course, in your example, your StringBuilder is redundant since you are not adding anything else to it, so you could just remove it: 当然，在您的示例中，您的StringBuilder是多余的，因为您没有在其中添加任何其他内容，因此您可以删除它：

byte[] rawDataBuffer = ...; // Read from InputStream
String rawString = new String(rawDataBuffer, "UTF-8");
onMessageReceived(rawString.getBytes("UTF-8"));

And then the String becomes redundant, too: 然后， String变得多余：

byte[] rawDataBuffer = ...; // Read from InputStream
onMessageReceived(rawDataBuffer);

If onMessageReceived() expects bytes as input, why waste overhead converting bytes to String to bytes again? 如果onMessageReceived()希望将字节作为输入，为什么还要浪费额外的开销将字节转换为String再转换为字节呢？

从US-ASCII编码的字符串中获取UTF-8编码

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-12-02 14:26:30

解决方案2
1 2015-12-03 02:08:39

从US-ASCII编码的字符串中获取UTF-8编码

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-12-02 14:26:30

解决方案2 1 2015-12-03 02:08:39

解决方案1
3 已采纳 2015-12-02 14:26:30

解决方案2
1 2015-12-03 02:08:39