需要说明使用C ++在Linux上创建utf-8编码的文件

Question

I need some explanations on encodage of files using g++ on Linux. 我需要在Linux上使用g ++进行文件编码的一些说明。

I have an easy code : 我有一个简单的代码：

int main ()
{
  FILE * pFile;
  char buffer[] = { 'x' , 'y' , 'z' ,'é' };
  pFile = fopen ("myfile", "wt, ccs=UTF-8");
  //pFile = fopen ("myfile", "wt");
  fwrite (buffer , sizeof(char), sizeof(buffer), pFile);
  fclose (pFile);
  return 0;
}

Even if the "ccs=UTF-8" part is added on the fopen line, the output file of this program is always encoded in iso-8859-1. 即使在fopen行上添加了“ ccs = UTF-8”部分，该程序的输出文件也始终以iso-8859-1编码。 However, if I create a file using vi on Linux containing theses charaters, the resulting file is UTF-8 encoded (I use the command "file myfile" to see the encoding mode of the file, and a "xxd -b myfile" confirm this behavior). 但是，如果我在Linux上使用vi创建包含这些字符的文件，则生成的文件是UTF-8编码的（我使用命令“ file myfile”查看文件的编码模式，并使用“ xxd -b myfile”确认这种行为）。

So I would like to undestand : 所以我想理解：

1- Why g++ on Linux doesn't create a UTF-8 file by default? 1-为什么Linux上的g ++默认情况下不会创建UTF-8文件？

2- What is the aim of the ccs=UTF-8 if the file created is not encoded in UTF-8? 2-如果创建的文件未使用UTF-8编码，则ccs = UTF-8的目的是什么？

3- How can I create an UTF-8 file based on this simple code? 3-如何基于此简单代码创建UTF-8文件？

Thanks. 谢谢。

Answer 1

Your file may appear to be in ISO-8859-1, but it's actually not. 您的文件可能看上去符合ISO-8859-1，但实际上不是。 It's simply broken. 简直是坏了。

Your file contains byte A9 , which is the lower byte of UTF-8 representation of é . 您的文件包含字节A9 ，它是é的UTF-8表示形式的低字节。

When you wrote 'é' , the compiler should have warned you: 当您编写'é' ，编译器应该警告您：

 aaa.c:4:38: warning: multi-character character constant [-Wmultichar]
     char buffer[] = { 'x' , 'y' , 'z' ,'é' };
                                         ^

char is not a type for a character, it's a type for one byte. char不是char的类型，而是一个字节的类型。 GCC treats multibyte character literals as big-endian integers. GCC将多字节字符文字视为大端整数。 Here, you cast it immediately to char , leaving the lowest byte: A9 在这里，您将其立即转换为char ，保留最低字节： A9

(BTW, é in ISO-8859-1 is E9 , not A9 ) （顺便说一句，ISO-8859-1中的é是E9 ，而不是A9 ）

You open your file with an encoding, but then you save bytes into it. 您使用编码打开文件，但随后将字节保存到其中。 The bytes correspond to ISO-8859-1 characters xyz© . 字节对应于ISO-8859-1字符xyz© 。

If you want to write characters, not bytes, then use wchar_t instead of char and fputws instead of fwrite 如果要写字符而不是字节，请使用wchar_t代替char和fputws代替fwrite

#include <stdio.h>
#include <wchar.h>

int main ()
{
  FILE * pFile;
  // note final zero and L indicating wchar_t literal
  wchar_t buffer[] = { 'x' , 'y' , 'z' , L'é' , 0};
  // note no space before ccs
  pFile = fopen ("myfile", "wt,ccs=UTF-8");
  fputws(buffer, pFile);
  fclose (pFile);
  return 0;
}

需要说明使用C ++在Linux上创建utf-8编码的文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-12-05 14:35:20

需要说明使用C ++在Linux上创建utf-8编码的文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-12-05 14:35:20

解决方案1
1 已采纳 2014-12-05 14:35:20