简体   繁体   English

读取utf8字符串并将其写入文件

[英]read and write utf8 string to file

If I have UTF8 encoded string in C (basically a char -or unsigned char ?- array), and I want to write and read it from file (say in binary mode). 如果我在C中有UTF8编码的字符串(基本上是charunsigned char ?-数组),并且我想从文件中写入和读取它(例如,以二进制模式)。 Is there anything different I need to do with it, as compared to if I were writing/reading just ASCII characters? 与仅编写/读取ASCII字符相比,我需要做些什么?

Short answer: No, nothing different 简短答案:不,没什么不同

Longer Answer: As always, it depends.. 更长的答案:一如既往,这取决于..

It depends on what you're going to use to read the file afterwards. 这取决于您以后将用来读取文件的内容。 If it's some other application, you may need to give it a hint that the file is UTF-8 encoded text, by sticking a UTF-8 BOM at the front. 如果是其他应用程序,则可能需要通过在前面粘贴一个UTF-8 BOM来提示该文件是UTF-8编码的文本。 However, this is typically discouraged, so you can usually revert to the short answer! 但是,通常不鼓励这样做,因此您通常可以回复简短的答案!

However your comments imply your interested in processing the char array, rather than simply reading/writing it. 但是,您的注释暗示您对处理 char数组感兴趣,而不是简单地读取/编写它。 Yes, you may need to do things differently, depending entirely on whay you want to do. 是的,您可能需要完全根据您想做的事情来做不同的事情。 Because a single 'unicode character' can be encoded as multiple bytes in the array, the for some operations (counting word lengths in text, for instance) you would need to be aware of the multi-byte characters. 由于单个“ unicode字符”可以被编码为数组中的多个字节,因此对于某些操作(例如,计算文本中的字长),您需要了解多字节字符。 But because all the 'extra' bytes in UTF8 have the high bit set, you're never going to get them mixed up with normal characters. 但是,因为UTF8中的所有“额外”字节都设置了高位,所以您永远都不会将它们与普通字符混淆。 So things like string search and replace are typically as per normal ASCII. 因此,诸如字符串搜索和替换之类的事情通常都是按照常规ASCII进行的。

If you're just ouputting it (no char counting or modifications), you shouldn't have to worry about it. 如果您只是提出要求(不进行字符计数或修改),则不必担心。 On Linux with gcc, you can even put UTF8 inside of strings in your source, and it works fine. 在具有gcc的Linux上,您甚至可以将UTF8放在源代码中的字符串中,并且效果很好。

Eg: 例如:

 puts("řčšéíčšřáčéířáéíščřáéíčřáščéřáěéířěéčšě"); //Will work correctly on Linux

It's just that č , for example, won't be represented by a single char . 例如,仅č不会由单个char表示。

As long as you are fine with not actually using the signs for math operations, you should be fine. 只要您没有实际使用符号进行数学运算就可以,那么您应该会很好。

UTF8 expects at least 8 bits per character "unit", and C chars, signed or not, are guaranteed to have these. UTF8期望每个字符“单位” 至少 8位,并且保证有符号或无符号的C字符都具有这些。 Nothing is different -- except , of course, when you have a habit of adding up "a" to "b" (a nonsense operation on text) or converting to and from integers (which is as okay as it is with "regular" ASCII text with occasional high ASCII characters, ie, if you take care of conversions when they may happen, you should be fine). 没什么不同-当然, 除了当您习惯将“ a”加到“ b”(对文本进行无意义的操作)或在整数之间进行转换(与使用“ regular”进行转换一样好)时ASCII文本偶尔带有高ASCII字符,即,如果您在可能发生转换的情况下进行转换,则应该没事。

With that out of the way: if you are planning to show your output, you might want to use the same type -- signed or unsigned -- as your output library. 这样就可以了:如果您打算显示输出,则可能要使用与输出库相同的类型(有符号或无符号)。

If I have to output UTF8 to the screen console (OSX's Terminal window, which is fully capable of showing UTF8) I use regular char strings, so I can use standard stdlib and string functions. 如果必须将UTF8输出到屏幕控制台(OSX的Terminal窗口,它完全能够显示UTF8),则可以使用常规的char字符串,因此可以使用标准的stdlib和string函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM