简体   繁体   English

Linux和C编程:如何将utf-8编码的文本写入文件?

[英]Linux & C-Programming: How can I write utf-8 encoded text to a file?

I am interested in writing utf-8 encoded strings to a file. 我有兴趣将utf-8编码的字符串写入文件。

I did this with low level functions open() and write(). 我使用低级函数open()和write()做到了这一点。 In the first place I set the locale to a utf-8 aware character set with setlocale("LC_ALL", "de_DE.utf8") . 首先,我使用setlocale("LC_ALL", "de_DE.utf8")将语言环境设置为可setlocale("LC_ALL", "de_DE.utf8") utf-8的字符集。 But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. 但是生成的文件不包含utf-8字符,仅包含iso8859编码的变音符号。 What am I doing wrong? 我究竟做错了什么?

Addendum: I don't know if my strings are really utf-8 encoded in the first place. 附录:我不知道我的字符串是否真的是utf-8编码的。 I just keep them in the source file in this form: char *msg = "Rote Grütze"; 我只是将它们以这种形式保存在源文件中: char *msg = "Rote Grütze";

See screenshot for content of the textfile: alt text http://img19.imageshack.us/img19/9791/picture1jh9.png 请参阅屏幕快照以获取文本文件的内容: 替代文本http://img19.imageshack.us/img19/9791/picture1jh9.png

Changing the locale won't change the actual data written to the file using write(). 更改语言环境不会更改使用write()写入文件的实际数据。 You have to actually produce UTF-8 characters to write them to a file. 您必须实际产生 UTF-8字符才能将它们写入文件。 For that purpose you can use libraries as ICU . 为此,您可以将库用作ICU

Edit after your edit of the question : UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). 编辑问题后进行编辑 :UTF-8字符的“特殊”符号(ümlauts,áccénts等)仅与ISO-8859不同。 So, for all the text that doesn't have any of this symbols, both are equivalent. 因此,对于所有没有任何这些符号的文本,两者都是等效的。 However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. 但是,如果在程序字符串中包含这些符号,则必须确保文本编辑器将数据视为UTF-8。 Sometimes you just have to tell it to. 有时您只需要告诉它即可。

To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8. 综上所述,如果源代码中的字符串使用UTF-8,则您生成的文本将使用UTF-8。

Another edit : Just to be sure, you can convert your source code to UTF-8 using iconv: 另一个编辑 :可以肯定的是,您可以使用iconv将源代码转换为UTF-8:

iconv -f latin1 -t utf8 file.c

This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. 这会将您所有的latin-1字符串转换为utf8,当您打印它们时,它们肯定会使用UTF-8。 If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already. 如果iconv遇到一个奇怪的字符,或者您看到带有奇怪字符的输出字符串,则您的字符串已经在UTF-8中。

Regards, 问候,

Yes, you can do it with glibc. 是的,您可以使用glibc来实现。 They call it multibyte instead of UTF-8, because it can handle more than one encoding type. 他们称其为多字节而不是UTF-8,因为它可以处理多种编码类型。 Check out this part of the manual. 查看手册的这一部分。

Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. 查找以前缀mb开头的函数,以及以wc前缀开头的函数,以从多字节转换为宽字符。 You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support. 您必须首先使用setlocale()将语言环境设置为UTF-8,以便它选择此多字节支持实现。

If you are coming from an Unicode file I believe the function you looking for is wcstombs(). 如果您来自Unicode文件,我相信您要查找的功能是wcstombs()。

Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). 您能否在十六进制编辑器中打开文件,并通过一个简单的输入示例来验证写入的字节不是传递给write()的Unicode字符的值。 Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set. 有时,文本编辑器无法确定字符集,并且您的文本编辑器可能已采用了ISO8859-1字符集。

Once you have done this, could you edit your original post to add the pertinent information? 完成此操作后,您可以编辑原始帖子以添加相关信息吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM