简体繁体 English

处理UTF-8编码

[英]Handling UTF-8 encoding

原文 2008-11-06 15:53:09 1 3 java/ xml/ unicode/ encoding/ utf-8

We have an Java application running on Weblogic server that picks up XML messages from a JMS or MQ queue and writes it into another JMS queue. 我们有一个在Weblogic服务器上运行的Java应用程序，它从JMS或MQ队列中获取XML消息并将其写入另一个JMS队列。 The application doesn't modify the XML content in any way. 应用程序不以任何方式修改XML内容。 We use BEA's XMLObject to read and write the messages into queues. 我们使用BEA的XMLObject来读取和写入队列中的消息。

The XML messages contain the encoding type declarations as UTF-8. XML消息包含UTF-8的编码类型声明。

We have an issue when the XML contains characters that are out side the normal ASCII range (like £ symbol for example). 当XML包含超出正常ASCII范围的字符时（例如£符号），我们遇到了问题。 When the message is read from the queue we can see that the £ symbol is intact, however once we write it to the destination queue, the £ symbol is lost and is replaced with Â£ instead. 当从队列中读取消息时，我们可以看到£符号是完整的，但是一旦我们将它写入目标队列，£符号就会丢失并被替换为£。

I have checked the OS level settings (locale settings) and everything seems to be fine. 我检查了操作系统级别设置（区域设置），一切似乎都没问题。 What else should I be checking to make sure that this doesn't happen? 我还应该检查什么以确保不会发生这种情况？

3 个解决方案

once we write it to the destination queue, the £ symbol is lost and is replaced with Â£ instead 一旦我们将它写入目标队列，£符号就会丢失，而是替换为£

That tells me the character is being written as UTF-8, but it's being read as if it were in a single-byte encoding like ISO-8859-1. 这告诉我字符被写为UTF-8，但它被读取就好像它是像ISO-8859-1这样的单字节编码。 (For any character in the range U+00A0..U+00BF, if you encode it as UTF-8 and decode it as ISO-8859-1, you end up with the two-character sequence ÃX , where X is the original character.) I would look at the encoding settings of the receiving JMS queue. （对于U + 00A0..U + 00BF范围内的任何字符，如果将其编码为UTF-8并将其解码为ISO-8859-1，则最终得到两个字符序列ÃX ，其中X是原始字符我会看一下接收JMS队列的编码设置。

You should use InputStream , OutputStream , and byte[] to handle XML documents, not Reader , Writer , and String . 您应该使用InputStream ， OutputStream和byte[]来处理XML文档，而不是Reader ， Writer和String 。 In the world of JMS, BytesMessage is a better fit for XML payloads than TextMessage . 在JMS领域， BytesMessage比TextMessage更适合XML有效负载。

Every XML document specifies its character encoding internally, and all XML processing APIs are oriented to take byte streams and where necessary figure out the correct character encoding to use themselves. 每个XML文档都在内部指定其字符编码，并且所有XML处理API都面向字节流，并在必要时找出要使用的正确字符编码。 The text-based APIs are only there… to confuse people, I guess! 我想，基于文本的API只会让人迷惑！ Anyway, applications should let the XML processor deal with character encoding issues, rather than trying to manage it themselves (or using a text-oriented API without a solid understanding of character-encoding issues). 无论如何，应用程序应该让XML处理器处理字符编码问题，而不是试图自己管理它（或者使用面向文本的API而不必充分理解字符编码问题）。

Without a few more specifics, I'd guess that there is a method that optionally takes an encoding somewhere that isn't specified and is defaulting to ISO-8859-1. 如果没有更多细节，我猜测有一种方法可以选择在某个未指定的位置进行编码，并且默认为ISO-8859-1。 Commonly, check anything that passes between an InputStream/OutputStream and a Reader/Writer. 通常，检查在InputStream / OutputStream和Reader / Writer之间传递的任何内容。

For instance, an OutputStreamWriter takes an optional encoding that you could be leaving out. 例如， OutputStreamWriter采用您可能遗漏的可选编码。