简体   繁体   English

处理C中的特殊字符(UTF-8编码)

[英]Handling special characters in C (UTF-8 encoding)

I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. 我正在用C编写一个小应用程序来读取一个简单的文本文件,然后逐个输出这些行。 The problem is that the text file contains special characters like Æ, Ø and Å among others. 问题是文本文件包含特殊字符,如Æ,Ø和Å等。 When I run the program in terminal the output for those characters are represented with a "?". 当我在终端中运行程序时,这些字符的输出用“?”表示。

Is there an easy fix? 有一个简单的解决方案吗?

First things first: 首先要做的事情:

  1. Read in the buffer 读入缓冲区
  2. Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf() 使用libiconv或类似方法从UTF-8获取wchar_t类型并使用宽字符处理函数,如wprintf()
  3. Use the wide character functions in C! 使用C中的宽字符函数! Most file/output handling functions have a wide-character variant 大多数文件/输出处理函数都具有宽字符变体

Ensure that your terminal can handle UTF-8 output. 确保您的终端可以处理UTF-8输出。 Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing. 拥有正确的语言环境设置和操作语言环境数据可以为您自动执行大量文件打开和转换...取决于您正在做什么。

Remember that the width of a code-point or character in UTF-8 is variable. 请记住 ,UTF-8中的代码点或字符的宽度是可变的。 This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. 这意味着你不能只是寻找一个字节并开始像ASCII一样阅读......因为你可能会落在代码点的中间。 Good libraries can do this in some cases. 在某些情况下,好的库可以做到这一点。

Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C. 下面是一些代码(不是我的),它演示了在C中使用UTF-8文件读取和宽字符处理的一些用法。

#include <stdio.h>
#include <wchar.h>
int main()
{
    FILE *f = fopen("data.txt", "r, ccs=UTF-8");
    if (!f)
        return 1;

    for (wint_t c; (c = fgetwc(f)) != WEOF;)
        printf("%04X\n", c);

    fclose(f);
    return 0;
}

Links 链接

  1. libiconv libiconv的
  2. Locale data in C/GNU libc C / GNU libc中的区域设置数据
  3. Some handy info 一些方便的信息
  4. Another good Unicode/UTF-8 in C resource C资源中另一个优秀的Unicode / UTF-8

Make sure you're not accidentally dropping any bytes; 确保你不小心丢弃任何字节; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all. 一些UTF-8字符的长度超过一个字节(这就是重点),你需要保留它们。

It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read: 将缓冲区的内容打印为十六进制可能很有用,因此您可以检查实际读取的字节数:

static void print_buffer(const char *buffer, size_t length)
{
  size_t i;

  for(i = 0; i < length; i++)
    printf("%02x ", (unsigned int) buffer[i]);
  putchar('\n');
}

You can do this after loading a very short file, containing just a few characters. 您可以在加载一个包含几个字符的非常短的文件后执行此操作。

Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8. 还要确保终端设置为正确的编码,因此它将您的字符解释为UTF-8。

Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. 可能您的文本文件是ISO-8559-1编码但您的终端是UTF-8。 This kind of mismatch is a standard problem when dealing with byte-oriented text handling; 在处理面向字节的文本处理时,这种不匹配是一个标准问题; other C programs (such as the standard 'cat' and 'more' commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed. 其他C程序(例如标准的'cat'和'more'命令)将执行相同的操作,通常不会将其视为错误或需要修复的内容。

If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. 如果你想在Unicode字符级别而不是字节上运行,那么你需要在整个程序中使用wchar作为你的字符类型而不是char,并为用户提供开关来指定传入的文件编码实际上是什么。 (Whilst it is sometimes possible to guess, it's not very reliable.) (虽然有时可以猜测,但它不是很可靠。)

I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale() : 我不知道它是否有用,但如果您确定终端和输入文件的编码是相同的,您可以尝试setlocale()

#include <locale.h>
…
setlocale(LC_CTYPE, "");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM