简体   繁体   English

在C中读取和输出unicode

[英]Reading and outputting unicode in C

FILE * f = fopen("filename", "r");
int c;

while((c = fgetc(f)) != EOF) {
    printf("%c\n", c);
}

Hello, I have searched for a whole hour, found many wise dissertations on Unicode, but no answer to this simple question: 您好,我已经搜索了整整一个小时,发现了许多有关Unicode的明智论文,但没有回答以下简单问题:

what would be the shortest equivalent to these four lines, that can manage UTF8, on Linux using gcc and bash. 在Linux上使用gcc和bash可以管理UTF8的这四行代码的最短等效代码。

Thank you 谢谢

Something like this should work, given your system: 鉴于您的系统,类似这样的东西应该可以工作:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>


int main() {
   setlocale(LC_CTYPE, "en_GB.UTF-8");
   FILE * f = fopen("filename", "r");
   wint_t c;

   while((c = fgetwc(f)) != WEOF) {
      wprintf(L"%lc\n", c);
   }
}

The problem with your original code is that C doesn't realise (or care) that the characters are multibyte, and so your multibyte characters will be corrupted by the \\n between each of the bytes. 原始代码的问题是C不会意识到(或关心)字符是多字节的,因此您的多字节字符将被每个字节之间的\\n破坏。 With this version, a character is treated as UTF-8, and so %lc now may represent as many as 6 actual bytes, which are guaranteed to be output correctly. 在此版本中,字符被视为UTF-8,因此%lc现在可以表示多达6个实际字节,可以保证正确输出。 If the input has any ASCII, it'll simply use one byte per character as previously (since ASCII is compatible with UTF-8). 如果输入包含任何ASCII,则它将像以前一样简单地每个字符使用一个字节(因为ASCII与UTF-8兼容)。

strace is always useful for debugging things like this. strace对于调试这样的事情总是有用的。 As an example, if the file contains just ££ (£ has the UTF-8 sequence \\302\\243). 例如,如果文件仅包含££ (£具有UTF-8序列\\ 302 \\ 243)。 Your version produces: 您的版本产生:

write(1, "\302\n\243\n\302\n\243\n\n\n", 10) = 10

And mine, 还有我的

write(1, "\302\243\n\302\243\n", 6)     = 6

Note that once you read or write to a stream (including stdout ) it is set to either byte or wide orientation, and you will need to re-open the stream if you want to change it. 请注意,一旦读取或写入流(包括stdout ),它将被设置为字节或宽方向,并且如果要更改它,则需要重新打开该流。 So for example, if you wanted to read the UTF-8 file, but leave stdout as byte orientated, you could replace the wprintf with: 因此,例如,如果您想读取UTF-8文件,但将stdout保留为字节定向,则可以将wprintf替换为:

  printf("%lc\n", c);

This involves extra code in the background (to convert the formats), but provides better compatibility with other code that expect a byte stream. 这将在后台涉及额外的代码(以转换格式),但与期望字节流的其他代码提供更好的兼容性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM