简体   繁体   English

wchar_t *到char *转换问题

[英]wchar_t* to char* conversion problems

I have a problem with wchar_t* to char* conversion. 我有一个问题wchar_t*char*转换。

I'm getting a wchar_t* string from the FILE_NOTIFY_INFORMATION structure, returned by the ReadDirectoryChangesW WinAPI function, so I assume that string is correct. 我从ReadDirectoryChangesW WinAPI函数返回的FILE_NOTIFY_INFORMATION结构中获取了一个wchar_t*字符串,因此我假设该字符串是正确的。

Assume that wchar string is "New Text File.txt" In Visual Studio debugger when hovering on variable in shows "N" and some unknown Chinese letters. 假设wchar字符串是“New Text File.txt”在Visual Studio调试器中,当悬停在变量上时显示“N”和一些未知的中文字母。 Though in watches string is represented correctly. 虽然在手表中正确表示了字符串。

When I try to convert wchar to char with wcstombs 当我尝试使用wcstombs将wchar转换为char时

wcstombs(pfileName, pwfileName, fileInfo.FileNameLength);

it converts just two letters to char* ("Ne") and then generates an error. 它只将两个字母转换为char* (“Ne”),然后生成错误。

Some internal error in wcstombs.c at function _wcstombs_l_helper() at this block: 在此块的函数_wcstombs_l_helper()中wcstombs.c中的一些内部错误:

if (*pwcs > 255)  /* validate high byte */
{
    errno = EILSEQ;
    return (size_t)-1;  /* error */
}

It's not thrown up as exception. 它不会被视为异常。

What can be the problem? 可能是什么问题?

In order to do what you're trying to do The Right Way, there are several nontrivial things that you need to take into account. 为了做你正在尝试做的事情,正确的方法,你需要考虑几个重要的事情。 I'll do my best to break them down for you here. 我会尽力在这里为你分手。

Let's start with the definition of the count parameter from the wcstombs() function's documentation on MSDN : 让我们从MSDN上wcstombs()函数的文档中定义count参数开始:

The maximum number of bytes that can be stored in the multibyte output string. 可以存储在多字节输出字符串中的最大字节数。

Note that this does NOT say anything about the number of wide characters in the wide character input string. 请注意,这并未说明宽字符输入字符串中宽字符数。 Even though all of the wide characters in your example input string ("New Text File.txt") can be represented as single-byte ASCII characters, we cannot assume that each wide character in the input string will generate exactly one byte in the output string for every possible input string (if this statement confuses you, you should check out Joel's article on Unicode and character sets ). 尽管示例输入字符串中的所有宽字符(“New Text File.txt”)都可以表示为单字节ASCII字符,但我们不能假设输入字符串中的每个宽字符都会在输出中生成一个字节每个可能的输入字符串的字符串(如果这个语句让你困惑,你应该查看Joel关于Unicode和字符集的文章 )。 So, if you pass wcstombs() the size of the output buffer, how does it know how long the input string is? 所以,如果你传递wcstombs()输出缓冲区的大小,它如何知道输入字符串有多长? The documentation states that the input string is expected to be null-terminated, as per the standard C language convention: 根据标准C语言约定,文档声明输入字符串应该以null结尾:

If wcstombs encounters the wide-character null character (L'\\0') either before or when count occurs, it converts it to an 8-bit 0 and stops. 如果wcstombs在计数发生之前或计数发生时遇到宽字符空字符(L'\\ 0'),它会将其转换为8位0并停止。

Though this isn't explicitly stated in the documentation, we can infer that if the input string isn't null-terminated, wcstombs() will keep reading wide characters until it has written count bytes to the output string. 虽然文档中没有明确说明,但我们可以推断,如果输入字符串不是以空值终止的,那么wcstombs()将继续读取宽字符,直到它将count字节写入输出字符串。 So if you're dealing with a wide character string that isn't null-terminated, it isn't enough to just know how long the input string is; 因此,如果您正在处理一个非空终止的宽字符串,仅仅知道输入字符串的长度是不够的; you would have to somehow know exactly how many bytes the output string would need to be (which is impossible to determine without doing the conversion) and pass that as the count parameter to make wcstombs() do what you want it to do. 你必须以某种方式确切地知道输出字符串需要多少字节(如果不进行转换就无法确定)并将其作为count参数传递给wcstombs()做你想做的事情。

Why am I focusing so much on this null-termination issue? 为什么我如此关注这个空终止问题呢? Because the FILE_NOTIFY_INFORMATION structure's documentation on MSDN has this to say about its FileName field: 因为MSDN上的FILE_NOTIFY_INFORMATION结构文档有关于其FileName字段的说法:

A variable-length field that contains the file name relative to the directory handle. 一个可变长度字段,包含相对于目录句柄的文件名。 The file name is in the Unicode character format and is not null-terminated. 文件名采用Unicode字符格式,不以空值终止。

The fact that the FileName field isn't null-terminated explains why it has a bunch of "unknown Chinese letters" at the end of it when you look at it in the debugger. FileName字段不是以null结尾的事实解释了为什么当你在调试器中查看它时,它的末尾有一堆“未知的中文字母”。 The FILE_NOTIFY_INFORMATION structure's documentation also contains another nugget of wisdom regarding the FileNameLength field: FILE_NOTIFY_INFORMATION结构的文档还包含有关FileNameLength字段的另一个智慧块:

The size of the file name portion of the record, in bytes. 记录的文件名部分的大小,以字节为单位。

Note that this says bytes , not characters . 请注意,这表示字节 ,而不是字符 Therefore, even if you wanted to assume that each wide character in the input string will generate exactly one byte in the output string, you shouldn't be passing fileInfo.FileNameLength for count ; 因此,即使您想假设输入字符串中的每个宽字符都会在输出字符串中生成一个字节,您也不应该传递fileInfo.FileNameLength作为count ; you should be passing fileInfo.FileNameLength / sizeof(WCHAR) (or use a null-terminated input string, of course). 你应该传递fileInfo.FileNameLength / sizeof(WCHAR) (当然还是使用以null结尾的输入字符串)。 Putting all of this information together, we can finally understand why your original call to wcstombs() was failing: it was reading past the end of the string and choking on invalid data (thereby triggering the EILSEQ error). 将所有这些信息放在一起,我们终于可以理解为什么你对wcstombs()原始调用失败了:它正在读取字符串的结尾并阻塞无效数据(从而触发EILSEQ错误)。

Now that we've elucidated the problem, it's time to talk about a possible solution. 现在我们已经阐明了问题,现在是时候讨论可能的解决方案了。 In order to do this The Right Way, the first thing you need to know is how big your output buffer needs to be. 为了做到这一点,正确的方法,你需要知道的第一件事是你的输出缓冲区需要多大。 Luckily, there is one final tidbit in the documentation for wcstombs() that will help us out here: 幸运的是, wcstombs()的文档中有一个最后的消息可以帮助我们:

If the mbstr argument is NULL, wcstombs returns the required size in bytes of the destination string. 如果mbstr参数为NULL,则wcstombs将返回目标字符串所需的大小(以字节为单位)。

So the idiomatic way to use the wcstombs() function is to call it twice: the first time to determine how big your output buffer needs to be, and the second time to actually do the conversion. 因此,使用wcstombs()函数的惯用方法是调用它两次:第一次确定输出缓冲区需要多大,第二次实际进行转换。 The final thing to note is that as we stated previously, the wide character input string needs to be null-terminated for at least the first call to wcstombs() . 最后要注意的是,正如我们之前所说的那样,宽字符输入字符串至少需要在第一次调用wcstombs()以空值终止。

Putting this all together, here is a snippet of code that does what you are trying to do: 把这一切放在一起,这里有一段代码可以完成你想做的事情:

size_t fileNameLengthInWChars = fileInfo.FileNameLength / sizeof(WCHAR); //get the length of the filename in characters
WCHAR *pwNullTerminatedFileName = new WCHAR[fileNameLengthInWChars + 1]; //allocate an intermediate buffer to hold a null-terminated version of fileInfo.FileName; +1 for null terminator
wcsncpy(pwNullTerminatedFileName, fileInfo.FileName, fileNameLengthInWChars); //copy the filename into a the intermediate buffer
pwNullTerminatedFileName[fileNameLengthInWChars] = L'\0'; //null terminate the new buffer
size_t fileNameLengthInChars = wcstombs(NULL, pwNullTerminatedFileName, 0); //first call to wcstombs() determines how long the output buffer needs to be
char *pFileName = new char[fileNameLengthInChars + 1]; //allocate the final output buffer; +1 to leave room for null terminator
wcstombs(pFileName, pwNullTerminatedFileName, fileNameLengthInChars + 1); //finally do the conversion!

Of course, don't forget to call delete[] pwNullTerminatedFileName and delete[] pFileName when you're done with them to clean up. 当然,不要忘记调用delete[] pwNullTerminatedFileNamedelete[] pFileName当你与他们进行清理。

ONE LAST THING 最后一件事

After writing this answer, I reread your question a bit more closely and thought of another mistake you may be making. 写完这个答案之后,我会更仔细地重读你的问题并想到你可能犯的另一个错误。 You say that wcstombs() fails after just converting the first two letters ("Ne"), which means that it's hitting uninitialized data in the input string after the first two wide characters. 你说wcstombs()在转换前两个字母(“Ne”)之后就失败了,这意味着它在前两个宽字符后输入字符串中的未初始化数据。 Did you happen to use the assignment operator to copy one FILE_NOTIFY_INFORMATION variable to another? 您是否碰巧使用赋值运算符将一个FILE_NOTIFY_INFORMATION变量复制到另一个? For example, 例如,

FILE_NOTIFY_INFORMATION fileInfo = someOtherFileInfo;

If you did this, it would only copy the first two wide characters of someOtherFileInfo.FileName to fileInfo.FileName . 如果你这样做,它只会将someOtherFileInfo.FileName的前两个宽字符复制到fileInfo.FileName In order to understand why this is the case, consider the declaration of the FILE_NOTIFY_INFORMATION structure: 为了理解这种情况的原因,请考虑FILE_NOTIFY_INFORMATION结构的声明:

typedef struct _FILE_NOTIFY_INFORMATION {
  DWORD NextEntryOffset;
  DWORD Action;
  DWORD FileNameLength;
  WCHAR FileName[1];
} FILE_NOTIFY_INFORMATION, *PFILE_NOTIFY_INFORMATION;

When the compiler generates code for the assignment operation, it does't understand the trickery that is being pulled with FileName being a variable length field, so it just copies sizeof(FILE_NOTIFY_INFORMATION) bytes from someOtherFileInfo to fileInfo . 当编译器为赋值操作生成代码时,它不会理解使用FileName作为可变长度字段来sizeof(FILE_NOTIFY_INFORMATION) ,因此它只将sizeof(FILE_NOTIFY_INFORMATION)字节从someOtherFileInfo复制到fileInfo Since FileName is declared as an array of one WCHAR , you would think that only one character would be copied, but the compiler pads the struct to be an extra two bytes long (so that its length is an integer multiple of the size of an int ), which is why a second WCHAR is copied as well. 由于FileName被声明为一个WCHAR的数组,你会认为只复制一个字符,但编译器将结构填充为额外的两个字节长(因此它的长度是int大小的整数倍) ),这就是复制第二个WCHAR原因。

My guess is that the wide string that you are passing is invalid or incorrectly defined. 我的猜测是你传递的宽字符串无效或定义不正确。

How is pwFileName defined? pwFileNamepwFileName定义的? It seems you have a FILE_NOTIFY_INFORMATION structure defined as fileInfo , so why are you not using fileInfo.FileName , as shown below? 您似乎将FILE_NOTIFY_INFORMATION结构定义为fileInfo ,那么为什么不使用fileInfo.FileName ,如下所示?

wcstombs(pfileName, fileInfo.FileName, fileInfo.FileNameLength);

the error you get says it all, it found a character that it cannot convert to MB (cause it has no representation in MB), source : 你得到的错误说明了一切,它发现了一个无法转换为MB的字符(因为它在MB中没有表示), 来源

If wcstombs encounters a wide character it cannot convert to a multibyte character, it returns –1 cast to type size_t and sets errno to EILSEQ 如果wcstombs遇到宽字符,它无法转换为多字节字符,则返回-1强制转换为类型size_t并将errno设置为EILSEQ

In cases like this you should avoid 'assumed' input, and give an actual test case that fails. 在这种情况下,您应该避免“假设”输入,并给出一个失败的实际测试用例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM