简体   繁体   English

如何在用UTF8编码的C中打印unicode字符?

[英]how do I print unicode character in C encoded with UTF8?

I am trying to print magnifying glass ( http://www.fileformat.info/info/unicode/char/1f50e/index.htm ), and I get this error: 我正在尝试打印放大镜( http://www.fileformat.info/info/unicode/char/1f50e/index.htm ),我收到此错误:

[niko@dev1 ncurses]$ gcc -o utf8 -std=c99 $(ncursesw5-config --cflags --libs) utf8.c 
utf8.c: In function ‘main’:
utf8.c:12:10: error: \ud83d is not a valid universal character
   printw("\ud83ddd0e\n");         // escaped Unicode 
          ^
[niko@dev1 ncurses]$ cat utf8.c
#include <locale.h>
#include <curses.h>
#include <stdlib.h>


int main (int argc, char *argv[])
{
  setlocale(LC_ALL, "");

  initscr();

  printw("\ud83ddd0e\n");         // escaped Unicode 

  getch();
  endwin();

  return EXIT_SUCCESS;
}

What is the problem here? 这里有什么问题? For, example, if I have a decimal number of encoding, which for magnifying glass is 55357 , how would I print it in printf to ncurses screen? 例如,如果我有一个十进制数的编码,对于放大镜是55357,我将如何在printf中打印到ncurses屏幕? (without using wchar_t because it wastes a lot of memory) (不使用wchar_t,因为它浪费了大量内存)

The information on fileformat.info is wrong. fileformat.info上的信息是错误的。 The escapes on the page are \?\? . 页面上的转义\?\?\?\? This is an UTF-16 surrogate pair as used on Java, but it does not work on C, as GCC seems to require that one \\u\u003c/code> escape represent one Unicode codepoint, which half of the surrogate escape is not. 这是在Java上使用的UTF-16代理对,但它不适用于C,因为GCC似乎要求一个\\u\u003c/code>转义表示一个Unicode代码点,而代理转义的一半不是。

You should instead use \\U (uppercase) with 8 hexadecimal digits, so U+1F50E becomes \\U0001F50E . 您应该使用\\U (大写)和8个十六进制数字,因此U + 1F50E变为\\U0001F50E This escaped character is output correctly with printf . 使用printf正确输出此转义字符。


PS: if instead of magnifying glass you see something like ~_~T~N , make sure that you've called the setlocale and actually linked against -lncursesw , failure to do either will mean that garbage will be printed instead. PS:如果不是放大镜,你会看到~_~T~N ,确保你已经调用了setlocale并且实际上链接了-lncursesw ,未能做任何一个将意味着将打印垃圾。

You should not encode your string in UTF-16 (the \\ud8..\\udd.. ), but in UTF-8. 你不应该用UTF-16( \\ud8..\\udd.. )编码你的字符串,而是用UTF-8编码。 To convert it, run this command: 要转换它,请运行以下命令:

perl -e 'print pack("H*","d83ddd0e")' | iconv -f UTF-16 -t UTF-32 | hexdump -C

Then, you can see that your character is U+0001F50E. 然后,您可以看到您的角色是U + 0001F50E。 To insert this character back in your C code, use the \\U sequence, with a capital U. 要在C代码中插入此字符,请使用大写U的\\U序列。

"\U0001F50E"

By the way, your number 55357 is not the magnifying glass (U+1F50E), but only the first half of the magnifying glass encoded in UTF-16. 顺便说一下,你的55357号不是放大镜(U + 1F50E),而只是用UTF-16编码的放大镜的前半部分。

Some clarification is needed because OP asked more than one question: 需要做出一些澄清,因为OP提出了不止一个问题:

  • What is the problem here? 这里有什么问题?

    Antti Haapala answered the important part, which dealt with the improperly represented character. Antti Haapala回答了重要部分,该部分涉及不正确代表的角色。

  • For, example, if I have a decimal number of encoding, which for magnifying glass is 55357 , how would I print it in printf to ncurses screen? 例如,如果我有一个十进制数的编码,对于放大镜是55357,我将如何在printf中打印到ncurses屏幕? (without using wchar_t because it wastes a lot of memory) (不使用wchar_t,因为它浪费了大量内存)

    This was unanswered. 这没有答案。 The comment about wasting memory overlooks the fact that ncurses (ie, ncursesw ) will store all of that information in complex characters , which use even more memory than wide characters ( wchar_t ). 关于浪费内存的评论忽略了ncurses(即ncursesw )将所有信息存储在复杂字符中的事实, 复杂字符使用的内存比宽字符wchar_t )更多。

printw is similar to printf , but not identical. printwprintf类似,但不完全相同。 To see this, the printw manual page says 为了看到这一点, printw手册页

The printw , wprintw , mvprintw and mvwprintw routines are analogous to printf [see printf(3)]. printwwprintwmvprintwmvwprintw例程类似于 printf [见printf(3)]。 In effect, the string that would be output by printf is output instead as though waddstr were used on the given window. 实际上,输出将由printf输出的字符串,就像在给定窗口上使用waddstr一样。

To understand what analogous means, a dictionary might help (part of its meaning is "similar", but those are not synonymous). 要理解类似的含义, 字典可能会有所帮助(其中一部分意思是“相似”,但这些并不是同义词)。 But following the link to the waddstr manual page : 但是按照waddstr手册页的链接:

These functions write the (null-terminated) character string str on the given window. 这些函数在给定窗口上写入(以null结尾的)字符串str。 It is similar to calling waddch once for each character in the string. 类似于为字符串中的每个字符调用waddch一次。

Again, "similar" offers no guarantee that the behavior is identical. 同样,“类似”并不保证行为是相同的。 The waddch manual page gives more information. waddch手册页提供了更多信息。 Among other things, it tells what translations it will do for control- and nonprinting-characters. 除此之外,它还告诉它将为控制字符和非打印字符做什么翻译。 Also (the point) is that waddch in ncurses accepts a multibyte (read: "UTF-8") string and will display that if the locale and terminal support that. 另外(重点)是ncurses中的waddch接受一个多字节(读取:“UTF-8”)字符串,如果语言环境和终端支持那个字符串将显示。 That's different from X/Open Curses, as discussed in the Character Set subsection of the manual page's PORTABILITY section. 这与X / Open Curses不同,如手册页的PORTABILITY部分的字符 集子部分所述。

Those \\u\u003c/code> escapes tell the gcc to pass a UTF-8 string, which happens to work with ncurses. 那些\\u\u003c/code>转义告诉gcc传递UTF-8字符串,这恰好与ncurses一起使用。 The people concerned with standards will equivocate on whether it's guaranteed to work with printf , but let's not get into that swamp. 关注标准的人们会不确定它是否能保证与printf一起工作,但让我们不要进入那个沼泽地。

There is, by the way, no equivalent of printw which uses wchar_t arrays. 顺便说一句,没有相当于使用wchar_t数组的printw

You can use putwchar (see http://www.cplusplus.com/reference/cwchar/putwchar/ ) to print a wchar, but I don't believe it works for UTF-16 surrogate pairs. 您可以使用putwchar(参见http://www.cplusplus.com/reference/cwchar/putwchar/ )来打印wchar,但我不认为它适用于UTF-16代理对。

In any case, printing unicode text to the terminal is always undefined behavior. 在任何情况下,将unicode文本打印到终端始终是未定义的行为。 On unix systems, most terminals emulate the VT-100, and are only guaranteed to support 7-bit ASCII text. 在unix系统上,大多数终端模拟VT-100,并且只保证支持7位ASCII文本。 (this why the isprint function exists). (这就是为什么存在isprint函数)。

Your best option is to use a library like freetype2 or cairo+pango to render text to a surface or pixmap in a graphical application. 您最好的选择是使用像freetype2或cairo + pango这样的库来将文本渲染到图形应用程序中的曲面或像素图。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM