简体   繁体   English

通过JNI将双字节(WCHAR)字符串从C ++传递到Java

[英]Passing double-byte (WCHAR) strings from C++ to Java via JNI

I have a Java application that uses a C++ DLL via JNI. 我有一个通过JNI使用C ++ DLL的Java应用程序。 A few of the DLL's methods take string arguments, and some of them return objects that contain strings as well. 一些DLL的方法接受字符串参数,其中一些返回包含字符串的对象。

Currently the DLL does not support Unicode, so the string handling is rather easy: 目前DLL不支持Unicode,因此字符串处理相当容易:

  • Java calls String.getBytes() and passes the resulting array to the DLL, which simply treats the data as a char*. Java调用String.getBytes()并将结果数组传递给DLL,它只是将数据视为char *。
  • DLL uses NewStringUTF() to create a jstring from a const char*. DLL使用NewStringUTF()从const char *创建一个jstring。

I'm now in the process of modifying the DLL to support Unicode, switching to using the TCHAR type (which when UNICODE is defined uses windows' WCHAR datatype). 我现在正在修改DLL以支持Unicode,切换到使用TCHAR类型(当定义UNICODE时使用Windows的WCHAR数据类型)。 Modifying the DLL is going well, but I'm not sure how to modify the JNI portion of the code. 修改DLL进展顺利,但我不知道如何修改代码的JNI部分。

The only thing I can think of right now is this: 我现在唯一能想到的是:

  • Java calls String.getBytes(String charsetName) and passes the resulting array to the DLL, which treats the data as a wchar_t*. Java调用String.getBytes(String charsetName)并将结果数组传递给DLL,DLL将数据视为wchar_t *。
  • DLL no longer creates Strings, but instead passes jbyteArrays with the raw string data. DLL不再创建字符串,而是使用原始字符串数据传递jbyteArrays。 Java uses the String(byte[] bytes, String charsetName) constructor to actually create the String. Java使用String(byte [] bytes,String charsetName)构造函数来实际创建String。

The only problem with this method is that I'm not sure what charset name to use. 这种方法的唯一问题是我不确定要使用什么字符集名称。 WCHARs are 2-bytes long, so I'm pretty sure it's UTF-16, but there are 3 posibilities on the java side. WCHAR是2个字节长,所以我很确定它是UTF-16,但是在java方面有3个可能性。 UTF-16, UTF-16BE, and UTF-16LE. UTF-16,UTF-16BE和UTF-16LE。 I haven't found any documentation that tells me what the byte order is, but I can probably figure it out from some quick testing. 我没有找到任何文档告诉我字节顺序是什么,但我可以从一些快速测试中找出它。

Is there a better way? 有没有更好的办法? If possible I'd like to continue constructing the jstring objects within the DLL, as that way I won't have to modify any of the usages of those methods. 如果可能的话,我想继续在DLL中构造jstring对象,因为这样我就不必修改那些方法的任何用法。 However, the NewString JNI method doesn't take a charset identifier. 但是,NewString JNI方法不采用字符集标识符。

This answer suggests that the byte-ordering of WCHARS is not guaranteed... 这个答案表明WCHARS的字节顺序不能得到保证......

Since you are on Windows you could try WideCharToMultiByte to convert the WCHARs to UTF-8 and then use your existing JNI code. 由于您使用的是Windows,因此可以尝试使用WideCharToMultiByte将WCHAR转换为UTF-8,然后使用现有的JNI代码。

You will need to be careful using WideCharToMultiByte due to the possibility of buffer overruns in the lpMultiByteStr parameter. 由于lpMultiByteStr参数可能存在缓冲区溢出,因此您需要小心使用WideCharToMultiByte To get round this you should call the function twice, first with lpMultiByteStr set to NULL and cbMultiByte set to zero - this will return the length of the required lpMultiByteStr buffer without attempting to write to it. 为了解决这个问题,你应该调用该函数两次,首先将lpMultiByteStr设置为NULL并将cbMultiByte设置为零 - 这将返回所需的lpMultiByteStr缓冲区的长度而不尝试写入它。 Once you have the length you can allocate a buffer of the required size and call the function again. 获得长度后,您可以分配所需大小的缓冲区并再次调用该函数。

Example code: 示例代码:

int utf8_length;

wchar_t* utf16 = ...;

utf8_length = WideCharToMultiByte(
  CP_UTF8,           // Convert to UTF-8
  0,                 // No special character conversions required 
                     // (UTF-16 and UTF-8 support the same characters)
  utf16,             // UTF-16 string to convert
  -1,                // utf16 is NULL terminated (if not, use length)
  NULL,              // Determining correct output buffer size
  0,                 // Determining correct output buffer size
  NULL,              // Must be NULL for CP_UTF8
  NULL);             // Must be NULL for CP_UTF8

if (utf8_length == 0) {
  // Error - call GetLastError for details
}

char* utf8 = ...; // Allocate space for UTF-8 string

utf8_length = WideCharToMultiByte(
  CP_UTF8,           // Convert to UTF-8
  0,                 // No special character conversions required 
                     // (UTF-16 and UTF-8 support the same characters)
  utf16,             // UTF-16 string to convert
  -1,                // utf16 is NULL terminated (if not, use length)
  utf8,              // UTF-8 output buffer
  utf8_length,       // UTF-8 output buffer size
  NULL,              // Must be NULL for CP_UTF8
  NULL);             // Must be NULL for CP_UTF8

if (utf8_length == 0) {
  // Error - call GetLastError for details
}

I found a little faq about the byte order mark. 我发现了一个关于字节顺序标记的常见问题 Also from that FAQ: 同样来自FAQ:

UTF-16 and UTF-32 use code units that are two and four bytes long respectively. UTF-16和UTF-32分别使用两个和四个字节长的代码单元。 For these UTFs, there are three sub-flavors: BE, LE and unmarked. 对于这些UTF,有三种子风格:BE,LE和未标记。 The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used. BE形式使用大端字节序列化(最重要的字节优先),LE形式使用小端字节序列化(最低有效字节优先),未标记形式默认使用大端字节序列化,但可能包含字节顺序在开头标记以指示使用的实际字节序列化。

I'm assuming on the java side the UTF-16 will try to find this BOM and properly deal with the encoding. 我假设在java方面,UTF-16将尝试找到此BOM并正确处理编码。 We all know how dangerous assumptions can be... 我们都知道假设有多危险......

Edit because of comment: 由于评论而编辑:

Microsoft uses UTF16 little endian. 微软使用UTF16小端。 Java UTF-16 tries to interpret the BOM. Java UTF-16尝试解释BOM。 When lacking a BOM it defaults to UTF-16BE. 缺少BOM时,默认为UTF-16BE。 The BE and LE variants ignore the BOM. BE和LE变体忽略BOM。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM