简体   繁体   English

如何在iOS中将UTF-8字符串转换为wchars?

[英]How to convert a UTF-8 string to wchars in iOS?

I have a Win32 function which I need to port to iOS: 我有一个Win32功能,我需要移植到iOS:

// Loads UTF-8 file and converts to a UTF-16 string

bool LoadUTF8File(char const *filename, wstring &str)
{
    size_t size;
    bool rc = false;
    void *bytes = LoadFile(filename, &size);
    if(bytes != 0)
    {
        int len = MultiByteToWideChar(CP_UTF8, 0, (LPCCH)bytes, size, 0, 0);
        if(len > 0)
        {
            str.resize(len + 1);
            MultiByteToWideChar(CP_UTF8, 0, (LPCCH)bytes, size, &str[0], len);
            str[len] = '\0';
            rc = true;
        }
        delete[] bytes;
    }
    return rc;
}

// LoadFile returns the loaded file as a block of memory
// There is a 3 byte BOM which MultiByteToWideChar seems to ignore
// The text in the file is encoded as UTF-8

I'm using C++ for this, rather than Objective C, and I've been trying to use mbstowcs and _mbstowcs_l. 我正在使用C ++,而不是Objective C,我一直在尝试使用mbstowcs和_mbstowcs_l。 They don't seem to behave in the same way as MultiByteToWideChar. 它们的行为似乎与MultiByteToWideChar不同。 For example, the accented character at the end of the word attaché is not being correctly converted (the Win32 version correctly converts it). 例如,单词attaché末尾的重音字符未正确转换(Win32版本正确转换它)。 Is there a 'UTF-8 to UTF-16' function in the standard libraries somewhere? 某处的标准库中是否存在“UTF-8到UTF-16”功能?

Does the Win32 version have a bug in it which I'm not noticing? Win32版本中是否有一个我没有注意到的错误?

The length returned from MultiByteToWideChar is less than the length return from mbstowcs. 从MultiByowToWideChar返回的长度小于从mbstowcs返回的长度。

Weirdly, in this small test case 奇怪的是,在这个小测试案例中

    char *p = "attaché";

    wstring str;
    size_t size = strlen(p);
    setlocale(LC_ALL, "");
    int len = mbstowcs(null, p, size);
    if(len > 0)
    {
        str.resize(len + 1);
        mbstowcs(&str[0], p, size);
        str[len] = '\0';
    }
    TRACE(L"%s\n", str.c_str());

    len = MultiByteToWideChar(CP_UTF8, 0, p, size, null, 0);
    if(len > 0)
    {
        str.resize(len + 1);
        MultiByteToWideChar(CP_UTF8, 0, p, size, &str[0], len);
        str[len] = '\0';
    }
    TRACE(L"%s\n", str.c_str());

I get the correct output from mbcstowcs and MultiByteToWideChar erroneously converts the last character into 65533 (REPLACEMENT_CHARACTER). 我从mbcstowcs获得正确的输出,MultiBystToWideChar错误地将最后一个字符转换为65533(REPLACEMENT_CHARACTER)。 Now I'm confused... 现在我很困惑......

Are you stuck with using C++ for this or is it just the way you choose so far but are open to do it in Objective-C too ? 您是否仍然坚持使用C ++,或者它只是您目前所选择的方式,但也可以在Objective-C中使用它吗?

In Objective-C you can use [yourUTF8String dataUsingEncoding:NSUTF16StringEncoding] to get NSData containing the bytes of the UTF-16 representation of the string. 在Objective-C中,您可以使用[yourUTF8String dataUsingEncoding:NSUTF16StringEncoding]来获取包含字符串的UTF-16表示字节的NSData。


Additional hypothesis: Note that your "é" character that does not get correctly converted in your example may also be explained by the fact that your solution may not take NFD form (or NFC form, either one). 附加假设:请注意,在您的示例中未正确转换的“é”字符也可能是因为您的解决方案可能不采用NFD形式(或NFC形式,任一个)。 This means that if the "é" character is encoded in NFD for as in "the character 'e' with a acute accent" it may not be interpreted correctly whereas the NFC form (as in "the accented e character", ie the pre-composed character directly) it will. 这意味着如果“é”字符在NFD中编码为“具有急性重音的字符'e”,则可能无法正确解释而NFC形式(如“重音e字符”,即前直接组成的字符)它会。 Or vice-versa. 或相反亦然。

That's just one hypothesis, in fact it depends on what result you have instead of the "é" character you expect, but it's worth checking. 这只是一个假设,实际上它取决于你有什么结果而不是你期望的“é”字符,但值得检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM