简体   繁体   中英

JSON character encoding in javascript different from java

the java code below

    JSONObject obj = new JSONObject();
    try{
        obj.put("alert","•é");
        byte[] test = obj.toString().getBytes("UTF-8");
        logger.info("bytes are"+ test);
    } catch (JSONException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (UnsupportedEncodingException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    };

produces a JSONObject which escapes the bullet character, but not the latin letter e with grave, eg ""\•é", the byte code is [123, 34, 97, 108, 101, 114, 116, 34, 58, 34, 92, 117, 50, 48, 50, 50, -61, -87, 34, 125]

How can get I the same exact output in Javascript (in terms of byte sequence)? I don't understand why JSONObject is only escaping one character but not the other. I don't know what rule it followed.

It seems in javascript I can only either escape everything other than the ASCII, (eg.\-\￿) or don't escape at all.

Thanks!

There are two different things happening: Unicode encoding and JSON string escaping .

Per 2.5 Strings of the JSON RFC:

.. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped ..

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence .. [and characters outside the BMP are escaped as UTF-16 encoded surrogate pairs]

That is, the JSON strings of "•é" and "\•é" are equivalent . It is entirely up to the serialization implementation on which (additional) characters to escape, and both forms are valid.

It is this JSON string (which is Unicode text) that can be encoded when converted to a byte-stream. In the example it's encoded via UTF-8 encoding. A JSON string may then be equivalent without being byte-equivalent at the stream level or character-equivalent at the JSON text level.


As far as the rules for JSONObject, it escapes according to

    c < ' '
|| (c >= '\u0080' && c < '\u00a0')
|| (c >= '\u2000' && c < '\u2100')

One reason these characters, in the range [\ , \℀] , may be escaped is to ensure the resulting JSON is also valid JavaScript. The article JSON: The JavaScript subset that isn't discusses the issue: the problem is the Unicode code-points \
 and \
 are treated as line terminators in JavaScript string literals, but not JSON. (There are other Unicode Separator characters in the range: might as well catch them in one go.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM