简体   繁体   English

如何将UTF-16转换为UTF-32并在C中打印生成的wchar_t?

[英]How to Convert UTF-16 to UTF-32 and Print the Resulting wchar_t in C?

i'm trying to print out a string of UTF-16 characters. 我正在尝试打印出一串UTF-16字符。 i posted this question a while back and the advice given was to convert to UTF-32 using iconv and print it as a string of wchar_t. 我暂时发布了这个问题,给出的建议是使用iconv转换为UTF-32并将其打印为一串wchar_t。

i've done some research, and managed to code the following: 我做了一些研究,并设法编写以下代码:

// *c is the pointer to the characters (UTF-16) i'm trying to print
// sz is the size in bytes of the input i'm trying to print

iconv_t icv;
char in_buf[sz];
char* in;
size_t in_sz;
char out_buf[sz * 2];
char* out;
size_t out_sz;

icv = iconv_open("UTF-32", "UTF-16");

memcpy(in_buf, c, sz);

in = in_buf;
in_sz = sz;
out = out_buf;
out_sz = sz * 2;

size_t ret = iconv(icv, &in, &in_sz, &out, &out_sz);
printf("ret = %d\n", ret);
printf("*** %ls ***\n", ((wchar_t*) out_buf));

The iconv call always return 0, so i guess conversion should be OK? iconv调用总是返回0,所以我猜转换应该没问题?

However, printing seems to be hit and miss. 但是,印刷似乎很受欢迎。 At times the converted wchar_t string prints OK. 有时,转换后的wchar_t字符串打印正常。 Other times, it seems to hit problem while printing the wchar_t, and terminates the printf function call altogether such that even the trailing "***" does not get printed. 其他时候,它似乎在打印wchar_t时遇到问题,并且完全终止printf函数调用,使得即使是尾随的“***”也不会被打印。

i also tried using 我也试过用

wprintf(((wchar_t*) "*** %ls ***\n"), out_buf));

but nothing ever gets printed. 但什么都没有打印出来。

Am i missing something here? 我错过了什么吗?

Reference: How to Print UTF-16 Characters in C? 参考: 如何在C中打印UTF-16字符?

UPDATE UPDATE

incorporated some of the suggestions in the comments. 在评论中纳入了一些建议。

updated code: 更新的代码:

// *c is the pointer to the characters (UTF-16) i'm trying to print
// sz is the size in bytes of the input i'm trying to print

iconv_t icv;
char in_buf[sz];
char* in;
size_t in_sz;
wchar_t out_buf[sz / 2];
char* out;
size_t out_sz;

icv = iconv_open("UTF-32", "UTF-16");

memcpy(in_buf, c, sz);

in = in_buf;
in_sz = sz;
out = (char*) out_buf;
out_sz = sz * 2;

size_t ret = iconv(icv, &in, &in_sz, &out, &out_sz);
printf("ret = %d\n", ret);
printf("*** %ls ***\n", out_buf);
wprintf(L"*** %ls ***\n", out_buf);

still the same result, not all the UTF-16 strings get printed (both the printf and the wprintf). 仍然是相同的结果,并非所有UTF-16字符串都被打印(printf和wprintf)。

what else could i be missing? 我还能错过什么?

btw, i'm using Linux, and have verified that wchar_t is 4 bytes. 顺便说一下,我正在使用Linux,并且已经验证wchar_t是4个字节。

Here is a short program that converts UTF-16 to a wide character array and then prints it out. 这是一个简短的程序,它将UTF-16转换为宽字符数组,然后将其打印出来。

#include <endian.h>
#include <errno.h>
#include <iconv.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>

#define FROMCODE "UTF-16"

#if (BYTE_ORDER == LITTLE_ENDIAN)
#define TOCODE "UTF-32LE"
#elif (BYTE_ORDER == BIG_ENDIAN)
#define TOCODE "UTF-32BE"
#else
#error Unsupported byte order
#endif

int main(void)
{
    void *tmp;
    char *outbuf;
    const char *inbuf;
    long converted = 0;
    wchar_t *out = NULL;
    int status = EXIT_SUCCESS, n;
    size_t inbytesleft, outbytesleft, size;
    const char in[] = {
        0xff, 0xfe,
        'H', 0x0,
        'e', 0x0,
        'l', 0x0,
        'l', 0x0,
        'o', 0x0,
        ',', 0x0,
        ' ', 0x0,
        'W', 0x0,
        'o', 0x0,
        'r', 0x0,
        'l', 0x0,
        'd', 0x0,
        '!', 0x0
    };
    iconv_t cd = iconv_open(TOCODE, FROMCODE);
    if ((iconv_t)-1 == cd) {
        if (EINVAL == errno) {
            fprintf(stderr, "iconv: cannot convert from %s to %s\n",
                    FROMCODE, TOCODE);
        } else {
            fprintf(stderr, "iconv: %s\n", strerror(errno));
        }
        goto error;
    }
    size = sizeof(in) * sizeof(wchar_t);
    inbuf = in;
    inbytesleft = sizeof(in);
    while (1) {
        tmp = realloc(out, size + sizeof(wchar_t));
        if (!tmp) {
            fprintf(stderr, "realloc: %s\n", strerror(errno));
            goto error;
        }
        out = tmp;
        outbuf = (char *)out + converted;
        outbytesleft = size - converted;
        n = iconv(cd, (char **)&inbuf, &inbytesleft, &outbuf, &outbytesleft);
        if (-1 == n) {
            if (EINVAL == errno) {
                /* junk at the end of the buffer, ignore it */
                break;
            } else if (E2BIG != errno) {
                /* unrecoverable error */
                fprintf(stderr, "iconv: %s\n", strerror(errno));
                goto error;
            }
            /* increase the size of the output buffer */
            converted = size - outbytesleft;
            size <<= 1;
        } else {
            /* done */
            break;
        }
    }
    converted = (size - outbytesleft) / sizeof(wchar_t);
    out[converted] = L'\0';
    fprintf(stdout, "%ls\n", out);
    /* flush the iconv buffer */
    iconv(cd, NULL, NULL, &outbuf, &outbytesleft);
exit:
    if (out) {
        free(out);
    }
    if (cd) {
        iconv_close(cd);
    }
    exit(status);
error:
    status = EXIT_FAILURE;
    goto exit;
}

Since UTF-16 is a variable-length encoding you're guessing how big your output buffer needs to be. 由于UTF-16是一种可变长度编码,因此您猜测输出缓冲区需要多大。 A correct program should handle the case where the output buffer isn't large enough to hold the converted data. 正确的程序应该处理输出缓冲区不足以容纳转换数据的情况。

You should also note that iconv doesn't NULL -terminate your output buffer for you. 您还应注意iconv不为NULL您输出缓冲区。

Iconv is a stream-oriented processor, so you need to flush iconv_t if you want to reuse it for another conversion (the sample code does this near the end). Iconv是面向流的处理器,因此如果要将其重新用于另一次转换,则需要刷新iconv_t (示例代码在接近结束时执行此操作)。 If you want do stream processing you would handle the EINVAL error, copying any bytes left in the input buffer to the beginning of the new input buffer before calling iconv again. 如果你想进行流处理,你将处理EINVAL错误,将输入缓冲区中剩余的任何字节复制到新输入缓冲区的开头,然后再次调用iconv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM