在C中读取和输出unicode

Question

FILE * f = fopen("filename", "r");
int c;

while((c = fgetc(f)) != EOF) {
    printf("%c\n", c);
}

Hello, I have searched for a whole hour, found many wise dissertations on Unicode, but no answer to this simple question: 您好，我已经搜索了整整一个小时，发现了许多有关Unicode的明智论文，但没有回答以下简单问题：

what would be the shortest equivalent to these four lines, that can manage UTF8, on Linux using gcc and bash. 在Linux上使用gcc和bash可以管理UTF8的这四行代码的最短等效代码。

Thank you 谢谢

Answer 1

Something like this should work, given your system: 鉴于您的系统，类似这样的东西应该可以工作：

#include <stdio.h>
#include <wchar.h>
#include <locale.h>


int main() {
   setlocale(LC_CTYPE, "en_GB.UTF-8");
   FILE * f = fopen("filename", "r");
   wint_t c;

   while((c = fgetwc(f)) != WEOF) {
      wprintf(L"%lc\n", c);
   }
}

The problem with your original code is that C doesn't realise (or care) that the characters are multibyte, and so your multibyte characters will be corrupted by the \\n between each of the bytes. 原始代码的问题是C不会意识到（或关心）字符是多字节的，因此您的多字节字符将被每个字节之间的\\n破坏。 With this version, a character is treated as UTF-8, and so %lc now may represent as many as 6 actual bytes, which are guaranteed to be output correctly. 在此版本中，字符被视为UTF-8，因此%lc现在可以表示多达6个实际字节，可以保证正确输出。 If the input has any ASCII, it'll simply use one byte per character as previously (since ASCII is compatible with UTF-8). 如果输入包含任何ASCII，则它将像以前一样简单地每个字符使用一个字节（因为ASCII与UTF-8兼容）。

strace is always useful for debugging things like this. strace对于调试这样的事情总是有用的。 As an example, if the file contains just ££ (£ has the UTF-8 sequence \\302\\243). 例如，如果文件仅包含££ （£具有UTF-8序列\\ 302 \\ 243）。 Your version produces: 您的版本产生：

write(1, "\302\n\243\n\302\n\243\n\n\n", 10) = 10

And mine, 还有我的

write(1, "\302\243\n\302\243\n", 6)     = 6

Note that once you read or write to a stream (including stdout ) it is set to either byte or wide orientation, and you will need to re-open the stream if you want to change it. 请注意，一旦读取或写入流（包括stdout ），它将被设置为字节或宽方向，并且如果要更改它，则需要重新打开该流。 So for example, if you wanted to read the UTF-8 file, but leave stdout as byte orientated, you could replace the wprintf with: 因此，例如，如果您想读取UTF-8文件，但将stdout保留为字节定向，则可以将wprintf替换为：

  printf("%lc\n", c);

This involves extra code in the background (to convert the formats), but provides better compatibility with other code that expect a byte stream. 这将在后台涉及额外的代码（以转换格式），但与期望字节流的其他代码提供更好的兼容性。

在C中读取和输出unicode

问题描述

1 个解决方案

解决方案1
6 已采纳 2013-03-16 17:15:42

在C中读取和输出unicode

问题描述

1 个解决方案

解决方案1 6 已采纳 2013-03-16 17:15:42

解决方案1
6 已采纳 2013-03-16 17:15:42