简体   繁体   English

Java将字符串UTF-8转换为UTF-16

[英]Java convert String UTF-8 to UTF-16

I try to convert String a = "try" to String UTF-16 I did this : 我尝试将String a =“ try”转换为String UTF-16,我这样做是:

 try {
            String ulany = new String("357810087745445");
            System.out.println(ulany.getBytes().length);
            String string = new String(ulany.getBytes(), "UTF-16");
            System.out.println(string.getBytes().length);
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }

And ulany.getBytes().length = 15 and System.out.println(string.getBytes().length) = 24 but I think that it should be 30 what I did wrong ? 和ulany.getBytes()。length = 15和System.out.println(string.getBytes()。length)= 24,但是我认为应该是30,我做错了什么?

String (and char ) hold Unicode. 字符串 (和char )保存Unicode。 So nothing is needed. 因此,无需任何操作。

However if you want bytes , binary data, that are in some encoding, like UTF-16, you need a conversion: 但是,如果要使用某种编码(例如UTF-16)的bytes ,二进制数据,则需要进行转换:

ulany.getBytes("UTF-16") // Those bytes are in UTF-16 big endian
ulany.getBytes("UTF-16LE")

However System.out uses the operating systems encoding, so one cannot just pick some different encoding. 但是System.out使用操作系统编码,因此不能仅仅选择一些不同的编码。

In fact char is UTF-16 encoded. 实际上char是UTF-16编码的。


What happens 怎么了

        //String ulany = new String("357810087745445");
        String ulany = "357810087745445";

The String copy constructor stems from the C++ beginning, and is senseless. String复制构造函数起源于C ++,并且毫无意义。

        System.out.println(ulany.getBytes().length);

This will run on different platforms differently, as getBytes() uses the default Charset. 由于getBytes()使用默认的Charset,它将在不同的平台上运行的方式有所不同。 Better 更好

        System.out.println(ulany.getBytes("UTF-8").length);

        String string = new String(ulany.getBytes(), "UTF-16");

This interpretes those bytes pairwise; 这将成对解释这些字节。 having 15 bytes is already wrong. 具有15个字节已经是错误的。 Evidently one gets 7 (8?) special characters, as the high byte is not zero. 显然,由于高字节不为零,因此一个字符获得7(8?)个特殊字符。

        System.out.println(string.getBytes().length);

Now getting 24 means an average 3 bytes per char. 现在获得24表示每个字符平均3个字节。 Hence the default platform encoding is probably UTF-8 creating multibyte sequences. 因此,默认的平台编码可能是创建多字节序列的UTF-8。

The string will contain something like: 该字符串将包含以下内容:

        String string = "\u3533\u3837\u3031\u3830\u3737\u3534\u3434?";

You can also include a text encoding in getBytes(). 您还可以在getBytes()中包含文本编码。 For example: 例如:

String string = new String(ulany.getBytes("UTF-8"), "UTF-16");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM