简体   繁体   English

wchar_t到八位字节-在C中?

[英]wchar_t to octets - in C?

I'm trying to store a wchar_t string as octets, but I'm positive I'm doing it wrong - anybody mind to validate my attempt? 我正在尝试将wchar_t字符串存储为八位字节,但是我很肯定自己做错了-有人介意验证我的尝试吗? What's going to happen when one char will consume 4 bytes? 一个字符占用4个字节会发生什么情况?

  unsigned int i;
  const wchar_t *wchar1 = L"abc";
  wprintf(L"%ls\r\n", wchar1);

  for (i=0;i< wcslen(wchar1);i++) {
    printf("(%d)", (wchar1[i]) & 255);
    printf("(%d)", (wchar1[i] >> 8) & 255);
  }

Unicode text is always encoded. Unicode文本始终是编码的。 Popular encodings are UTF-8, UTF-16 and UTF-32. 流行的编码是UTF-8,UTF-16和UTF-32。 Only the latter has a fixed size for a glyph. 只有后者具有固定的字形大小。 UTF-16 uses surrogates for codepoints in the upper planes, such a glyph uses 2 wchar_t. UTF-16在上层平面中使用代用品作为代码点,这样的字形使用2 wchar_t。 UTF-8 is byte oriented, it uses between 1 and 4 bytes to encode a codepoint. UTF-8是面向字节的,它使用1到4个字节对一个编码点进行编码。

UTF-8 is an excellent choice if you need to transcode the text to a byte oriented stream. 如果您需要将文本转码为面向字节的流,则UTF-8是一个绝佳的选择。 A very common choice for text files and HTML encoding on the Internet. Internet上文本文件和HTML编码的非常常见的选择。 If you use Windows then you can use WideCharToMultiByte() with CodePage = CP_UTF8. 如果使用Windows,则可以将WideCharToMultiByte()与CodePage = CP_UTF8一起使用。 A good alternative is the ICU library. ICU库是一个很好的选择。

Be careful to avoid byte encodings that translate text to a code page, such as wcstombs(). 请注意避免将文本转换为代码页的字节编码,例如wcstombs()。 They are lossy encodings, glyphs that don't have a corresponding character code in the code page are replaced by ?. 它们是有损编码,在代码页中没有相应字符代码的字形由?代替。

You can use the wcstombs() (widechar string to multibyte string) function provided in stdlib.h 您可以使用stdlib.h提供的wcstombs() (宽字符字符串到多字节字符串)功能

The prototype is as follows: 原型如下:

#include <stdlib.h>

size_t wcstombs(char *dest, const wchar_t *src, size_t n);

It will correctly convert your wchar_t string provided by src into a char (aka octets) string and write it to dest with at most n bytes. 它将src提供的wchar_t字符串正确转换为char (又名八位字节)字符串,并将其最多写入n个字节的dest

char wide_string[] = "Hellöw, Wörld! :)";
char mb_string[512]; /* Might want to calculate a better, more realistic size! */
int i, length;

memset(mb_string, 0, 512);
length = wcstombs(mb_string, wide_string, 511);

/* mb_string will be zero terminated if it wasn't cancelled by reaching the limit
 * before being finished with converting. If the limit WAS reached, the string
 * will not be zero terminated and you must do it yourself - not happening here */

for (i = 0; i < length; i++)
   printf("Octet #%d: '%02x'\n", i, mb_string[i]);

If you're trying to see the content of the memory buffer holding the string, you can do this: 如果试图查看包含字符串的内存缓冲区的内容,则可以执行以下操作:

  size_t len = wcslen(str) * sizeof(wchar_t);
  const char *ptr = (const char*)(str);
  for (i=0; i<len; i++) {
    printf("(%u)", ptr[i]);
  }

I don't know why printf and wprintf do not work together. 我不知道为什么printf和wprintf不能一起工作。 Following code works. 以下代码有效。

unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);

for(i=0; i<wcslen(wchar1); i++)
{   
    wprintf(L"(%d)", (wchar1[i]) & 255);
    wprintf(L"(%d)", (wchar1[i] >> 8) & 255);
}   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM