如何使用BOM对UTF-16LE字节数组进行编码/解码？

Question

I need to encode/decode UTF-16 byte arrays to and from java.lang.String . 我需要在java.lang.String对UTF-16字节数组进行编码/解码。 The byte arrays are given to me with a Byte Order Marker (BOM) , and I need to encoded byte arrays with a BOM. 字节数组是通过字节顺序标记（BOM）给我的，我需要使用BOM编码字节数组。

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. 另外，由于我正在与Microsoft客户端/服务器打交道，因此我希望以小字节序（与LE BOM一起）发出编码，以避免任何误解。 I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world. 我确实意识到，使用BOM可以在大端模式下工作，但是我不想在Windows世界上游。

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM: 举例来说，以下是一种方法，该方法将java.lang.String编码为带有BOM的little endian中的UTF-16 ：

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

What is the best way to do this in Java? 用Java做到这一点的最佳方法是什么？ Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning. 理想情况下，我想避免将整个字节数组复制到一个新的字节数组中，该数组在开始时分配了两个额外的字节。

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor : 解码这样的字符串也是如此，但是使用java.lang.String构造函数要简单得多：

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)

Answer 1

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. “ UTF-16”字符集名称将始终使用BOM进行编码，并且将使用大/小端顺序对数据进行解码，但是“ UnicodeBig”和“ UnicodeLittle”可用于按特定字节顺序进行编码。 Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\" to handle BOMs manually. 不使用BOM表使用UTF-16LE或UTF-16BE-有关如何使用“ \\ uFEFF”手动处理BOM表的信息，请参阅此文章。 See here for canonical naming of charset string names or (preferably) the Charset class. 请参见此处以获取字符集字符串名称或（最好是）字符集类的规范命名。 Also take note that only a limited subset of encodings are absolutely required to be supported. 还要注意，绝对只需要支持有限的编码子集。

Answer 2

This is how you do it in nio: 这是您在nio中的操作方式：

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that. 它当然应该更快，但是我不知道它在幕后制作了多少个数组，但是我对API的理解是应该将其最小化。

Answer 3

First off, for decoding you can use the character set "UTF-16"; 首先，可以使用字符集“ UTF-16”进行解码； that automatically detects an initial BOM. 自动检测初始BOM。 For encoding UTF-16BE, you can also use the "UTF-16" character set - that'll write a proper BOM and then output big endian stuff. 为了对UTF-16BE进行编码，您还可以使用“ UTF-16”字符集-这将编写适当的BOM，然后输出大端字节的东西。

For encoding to little endian with a BOM, I don't think your current code is too bad, even with the double allocation (unless your strings are truly monstrous). 对于使用BOM编码为小端的编码，我认为即使使用双重分配，您当前的代码也不会太糟糕（除非您的字符串确实很可怕）。 What you might want to do if they are is not deal with a byte array but rather a java.nio ByteBuffer, and use the java.nio.charset.CharsetEncoder class. 如果不是这样，您可能想做的不是处理字节数组，而是处理java.nio ByteBuffer，并使用java.nio.charset.CharsetEncoder类。 (Which you can get from Charset.forName("UTF-16LE").newEncoder()). （可以从Charset.forName（“ UTF-16LE”）。newEncoder（）中获得）。

Answer 4

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. 编辑：重新阅读您的问题，我看到您宁愿完全避免双数组分配。 Unfortunately the API doesn't give you that, as far as I know. 不幸的是，据我所知，API并没有为您提供该服务。 (There was a method, but it is deprecated, and you can't specify encoding with it). （有一种方法，但已弃用，您不能使用它指定编码）。

I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. 我在看到您的评论之前就写了以上内容，我认为使用nio类的答案是正确的。 I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done. 我当时在看，但是我对API不够熟悉，无法一手掌握如何完成此工作。

Answer 5

This is an old question, but still, I couldn't find an acceptable answer for my situation. 这是一个古老的问题，但仍然无法找到适合我的情况的答案。 Basically, Java doesn't have a built-in encoder for UTF-16LE with a BOM. 基本上，Java没有用于带有BOM的UTF-16LE的内置编码器。 And so, you have to roll out your own implementation. 因此，您必须推出自己的实现。

Here's what I ended up with: 我最终得到的是：

private byte[] encodeUTF16LEWithBOM(final String s) {
    ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
    byte[] bom = { (byte) 0xff, (byte) 0xfe };
    return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}

如何使用BOM对UTF-16LE字节数组进行编码/解码？

问题描述

5 个解决方案

解决方案1
28 已采纳 2009-05-18 20:08:45

解决方案2
7 2009-05-18 23:09:56

解决方案3
6 2009-05-18 20:15:47

解决方案4
2 2009-05-18 20:09:49

解决方案5
0 2017-08-24 22:17:10

如何使用BOM对UTF-16LE字节数组进行编码/解码？

问题描述

5 个解决方案

解决方案1 28 已采纳 2009-05-18 20:08:45

解决方案2 7 2009-05-18 23:09:56

解决方案3 6 2009-05-18 20:15:47

解决方案4 2 2009-05-18 20:09:49

解决方案5 0 2017-08-24 22:17:10

解决方案1
28 已采纳 2009-05-18 20:08:45

解决方案2
7 2009-05-18 23:09:56

解决方案3
6 2009-05-18 20:15:47

解决方案4
2 2009-05-18 20:09:49

解决方案5
0 2017-08-24 22:17:10