简体   繁体   English

如何使用BOM对UTF-16LE字节数组进行编码/解码?

[英]How do I encode/decode UTF-16LE byte arrays with a BOM?

I need to encode/decode UTF-16 byte arrays to and from java.lang.String . 我需要在java.lang.String对UTF-16字节数组进行编码/解码。 The byte arrays are given to me with a Byte Order Marker (BOM) , and I need to encoded byte arrays with a BOM. 字节数组是通过字节顺序标记(BOM)给我的,我需要使用BOM编码字节数组。

Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. 另外,由于我正在与Microsoft客户端/服务器打交道,因此我希望以小字节序(与LE BOM一起)发出编码,以避免任何误解。 I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world. 我确实意识到,使用BOM可以在大端模式下工作,但是我不想在Windows世界上游。

As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM: 举例来说,以下是一种方法,该方法将java.lang.String编码为带有BOM的little endian中的UTF-16

public static byte[] encodeString(String message) {

    byte[] tmp = null;
    try {
        tmp = message.getBytes("UTF-16LE");
    } catch(UnsupportedEncodingException e) {
        // should not possible
        AssertionError ae =
        new AssertionError("Could not encode UTF-16LE");
        ae.initCause(e);
        throw ae;
    }

    // use brute force method to add BOM
    byte[] utf16lemessage = new byte[2 + tmp.length];
    utf16lemessage[0] = (byte)0xFF;
    utf16lemessage[1] = (byte)0xFE;
    System.arraycopy(tmp, 0,
                     utf16lemessage, 2,
                     tmp.length);
    return utf16lemessage;
}

What is the best way to do this in Java? 用Java做到这一点的最佳方法是什么? Ideally I'd like to avoid copying the entire byte array into a new byte array that has two extra bytes allocated at the beginning. 理想情况下,我想避免将整个字节数组复制到一个新的字节数组中,该数组在开始时分配了两个额外的字节。

The same goes for decoding such a string, but that's much more straightforward by using the java.lang.String constructor : 解码这样的字符串也是如此,但是使用java.lang.String构造函数要简单得多:

public String(byte[] bytes,
              int offset,
              int length,
              String charsetName)

The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. “ UTF-16”字符集名称将始终使用BOM进行编码,并且将使用大/小端顺序对数据进行解码,但是“ UnicodeBig”和“ UnicodeLittle”可用于按特定字节顺序进行编码。 Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\" to handle BOMs manually. 不使用BOM表使用UTF-16LE或UTF-16BE-有关如何使用“ \\ uFEFF”手动处理BOM表的信息请参阅此文章 See here for canonical naming of charset string names or (preferably) the Charset class. 请参见此处以获取字符集字符串名称或(最好是) 字符集类的规范命名。 Also take note that only a limited subset of encodings are absolutely required to be supported. 还要注意,绝对只需要支持有限的编码子集

This is how you do it in nio: 这是您在nio中的操作方式:

    return Charset.forName("UTF-16LE").encode(message)
            .put(0, (byte) 0xFF)
            .put(1, (byte) 0xFE)
            .array();

It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that. 它当然应该更快,但是我不知道它在幕后制作了多少个数组,但是我对API的理解是应该将其最小化。

First off, for decoding you can use the character set "UTF-16"; 首先,可以使用字符集“ UTF-16”进行解码; that automatically detects an initial BOM. 自动检测初始BOM。 For encoding UTF-16BE, you can also use the "UTF-16" character set - that'll write a proper BOM and then output big endian stuff. 为了对UTF-16BE进行编码,您还可以使用“ UTF-16”字符集-这将编写适当的BOM,然后输出大端字节的东西。

For encoding to little endian with a BOM, I don't think your current code is too bad, even with the double allocation (unless your strings are truly monstrous). 对于使用BOM编码为小端的编码,我认为即使使用双重分配,您当前的代码也不会太糟糕(除非您的字符串确实很可怕)。 What you might want to do if they are is not deal with a byte array but rather a java.nio ByteBuffer, and use the java.nio.charset.CharsetEncoder class. 如果不是这样,您可能想做的不是处理字节数组,而是处理java.nio ByteBuffer,并使用java.nio.charset.CharsetEncoder类。 (Which you can get from Charset.forName("UTF-16LE").newEncoder()). (可以从Charset.forName(“ UTF-16LE”)。newEncoder()中获得)。

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
    byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
    byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
    return byteArrayOutputStream.toByteArray();

EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. 编辑:重新阅读您的问题,我看到您宁愿完全避免双数组分配。 Unfortunately the API doesn't give you that, as far as I know. 不幸的是,据我所知,API并没有为您提供该服务。 (There was a method, but it is deprecated, and you can't specify encoding with it). (有一种方法,但已弃用,您不能使用它指定编码)。

I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. 我在看到您的评论之前就写了以上内容,我认为使用nio类的答案是正确的。 I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done. 我当时在看,但是我对API不够熟悉,无法一手掌握如何完成此工作。

This is an old question, but still, I couldn't find an acceptable answer for my situation. 这是一个古老的问题,但仍然无法找到适合我的情况的答案。 Basically, Java doesn't have a built-in encoder for UTF-16LE with a BOM. 基本上,Java没有用于带有BOM的UTF-16LE的内置编码器。 And so, you have to roll out your own implementation. 因此,您必须推出自己的实现。

Here's what I ended up with: 我最终得到的是:

private byte[] encodeUTF16LEWithBOM(final String s) {
    ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
    byte[] bom = { (byte) 0xff, (byte) 0xfe };
    return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM