简体   繁体   中英

How to decode the Unicode encoding in java?

I have Search on my site we frame the query and send in the Request and Response comes back from the vendor as JSON. The vendor crawls our site and capture the data from our site and send response. In Our design we are converting the JSON into java object using GSON. We use the UTF-8 as charset in the Meta.

I have a situation the response has some times Unicode encoding for the special characters based on the request. The browser is rendering this Unicode encoding for special characters in a strange way. How should i decode this Unicode encoding?

For example, for the special character 'ndash' i see in the response it encoded as '\–'

To clarify the differences between Unicode and a character encoding

Unicode

  • is an abstract concept aiming to identify all letters ( currently > 110 000).

Character encoding

  • defines how a character can be represending by a sequence of bytes
  • one such encoding is utf-8 which uses 1-4 bytes to represent a Unicode character

A java String is always UTF-16 . Hence when you construct a String you can use the following String constructor

new String(byte[], encoding)

The second argument should be the encoding the characters are in when the client are sending them. If you don't explicilty define an encoding, you will get the default system encoding, which you can examine using Charset.defaultCharset(); .

You can manually set the default encoding as an argument when starting the JVM

-Dfile.encoding="utf-8"

Although rarely needed, you can also employ CharsetDecoder / CharsetEncoder .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM