简体   繁体   中英

Sending non-standard characters in XML

I'm debugging a third-party gateway system which translates binary messages into an XML webservice. There is an issue when it receives messages containing special characters 0x80, 0x81, 0x82 and 0x83 they are not sent as XML correctly.

I've narrowed down the problem to where they convert byte[] to String and produced a simple example of what's going wrong. The special values all get translated to the same "unknown" character.

public static void main(String[] args) {
    test(0x80);test(0x81);test(0x82);test(0x83);
}
public static void test(int value) {
    String message = new String(new byte[]{(byte)value});
    System.out.println(value + " => " + message + " => " + Arrays.toString(message.getBytes()));
}

Output

128 => � => [-17, -65, -67]
129 => � => [-17, -65, -67]
130 => � => [-17, -65, -67]
131 => � => [-17, -65, -67]

I'm wondering how this should be fixed. I've tried changing their code to use an explicit character set

new String(bytes, Charset.forName("UTF-8"))

However this results in same problem. The values 0x80-0x83 don't seem to exist as valid XML entities .

I've found you can use the character constructor which kind of works, but translates the following, which I'm not sure is correct??

new String(new char[]{(char) value}, 0, 1); 

Output

128 => weird box character 0080 => [-62, -128]
129 => weird box character 0081 => [-62, -127]
130 => weird box character 0082 => [-62, -126]
131 => weird box character 0083 => [-62, -125]

You cannot translate the bytes byte-wise to a Java-String. You have to consider the encoding of the binary data. Eg UTF-8 can have different byte length per character.

See UTF-8 & Unicode, what's with 0xC0 and 0x80?

You cannot transfer binary data directly inside an XML document - there is no valid way to have an ASCII zero for instance.

You need to encode it as ASCII strings (base64 or similar) and transfer that, and then unencode it in the receiving end.

First, using

String message = new String(new byte[]{(byte)value});

is almost always wrong. To convert byte[] to String you must decide which character encoding to use. The code above will (unfortunately) convert using the JVM default encoding, which depends on various OS settings (and may change at any time if the user changes these settings). In almost all cases you want to specify the encoding explicitly.

Now to your problem:

I'm wondering how this should be fixed. I've tried changing their code to use an explicit character set

new String(bytes, Charset.forName("UTF-8"))

However this results in same problem.

This is normal. You told Java to interpret the single-byte sequence "0x80" as UTF-8. However, this is not a valid UTF-8 string. Therefore Java uses the Unicode replacement character to indicate the error.

To solve this problem, you must find out what "0x80" etc. mean in the data you get. Find out what character encoding the data uses, and use that encoding to convert to String .


As a guess: The data might use the Windows encoding CP 1252 (often mixed up with ISO 8859-1). In CP 1252, 0x80 is the Euro character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM