简体   繁体   English

在C中读取Unicode文件并通过套接字将内容作为ASCII传递

[英]Reading a Unicode file in C and passing contents as ASCII via sockets

I've being trying to figure this out, but nothing seems to work. 我正试图解决这个问题,但似乎没有任何效果。 We have an application that reads thousands of transactions files using the normal "fopen fgets etc", which we parse using normal C functions "strstr, strchr, etc" and return back a normalized char *. 我们有一个应用程序使用正常的“fopen fgets etc”读取数千个事务文件,我们使用普通的C函数“strstr,strchr等”解析它们并返回一个规范化的char *。

However, now we need to read some files that are in Unicode (from Windows) and I am having a lot of trouble. 但是,现在我们需要读取一些Unicode(来自Windows)的文件,我遇到了很多麻烦。 From what I am working on, I only receive a FP (file pointer) without knowing if the FP points to a normal ascii file or Unicode and I need to send back to the application as char *. 根据我的工作,我只收到FP(文件指针)而不知道FP是否指向正常的ascii文件或Unicode,我需要以char *的形式发送回应用程序。

I also can not run command line tools to manually convert the whole file, because we are tailing it for new entries. 我也无法运行命令行工具来手动转换整个文件,因为我们正在为新条目添加它。

I tried using WideCharToMultiByte, mbsrtowcs, but it seems that after I read the file using fgets, and pass to them, the return is always empty (0 bytes). 我尝试使用WideCharToMultiByte,mbsrtowcs,但似乎在我使用fgets读取文件并传递给它们后,返回始终为空(0字节)。 Anyone have any example on how to do it properly? 任何人都有关于如何正确做到的任何例子? The online docs/manuals for these functions all miss good examples. 这些功能的在线文档/手册都错过了很好的例子。

Thanks! 谢谢!

I don't have the full answer, but part of the problem is determining the character encoding. 我没有完整的答案,但部分问题是确定字符编码。 Normally unicode format files created in windows will start with a byte-order-mark (BOM) - the unicode character U+FEFF. 通常,在Windows中创建的unicode格式文件将以字节顺序标记(BOM)开头 - unicode字符U + FEFF。 This can be used to determine what the encoding is, if one is found. 这可用于确定编码是什么(如果找到)。

If you have a string encoded using say UTF16, this will have any number of embedded NULL bytes, you cannot use the normal ASCII versions of the string functions (strlen and so on), as they will see the NULL bytes as the end of string marker. 如果你有一个使用say UTF16编码的字符串,这将有任意数量的嵌入式NULL字节,你不能使用普通的ASCII版本的字符串函数(strlen等),因为他们会看到NULL字节作为字符串的结尾标记。 Your standard library will have unicode enabled versions that you should use. 您的标准库将具有您应该使用的启用unicode的版本。

That's one of the problems with character encodings -- either you have to assume that it's in some encoding, you have to get that information from inside the data or from metadata, or you have to detect that. 这是字符编码的问题之一 - 要么你必须假设它在某种编码中,你必须从数据内部或从元数据中获取信息,或者你必须检测到它。

On Windows, it's common to use byte-order mark at the beginning of file, but this violates many practices and breaks a lot of things -- so it's not common in unix world. 在Windows上,在文件开头使用字节顺序标记很常见,但这违反了许多做法并且破坏了很多东西 - 因此在unix世界中并不常见。

There's a bunch of libraries devoted just for that -- unicode and character encodings. 有一堆专门用于此的库 - unicode和字符编码。 Most popular are iconv and ICU . 最受欢迎的是iconvICU

A few points: 几点:

If you can be sure that the UNICODE files have a Byte Order Mark (BOM) you can look out for that. 如果您可以确定UNICODE文件具有字节顺序标记(BOM),您可以查看它。 However UNICODE files are not required to have a BOM, so it depends on where they come from. 但是,UNICODE文件不需要具有BOM,因此它取决于它们的来源。

If the file is UNICODE, you cannot read it with fgets() you need to use fgetws() or fread(). 如果文件 UNICODE,则无法使用fgets()读取它,您需要使用fgetws()或fread()。 UNICODE characters may have zero bytes (bytes with a value of zero) which will confuse fgets(). UNICODE字符可能有零个字节(值为零的字节),这会混淆fgets()。

The zero bytes can be your friend. 零字节可以是你的朋友。 If you read in a lump of the file using fread(), and discover embedded zero bytes, it is likely that you have UNICODE. 如果你使用fread()读取文件的块,并发现嵌入的零字节,很可能你有UNICODE。 However the reverse is not true -- the absence of zero bytes does not prove that you have ASCII. 然而反之则不然 - 缺少零字节并不能证明你有ASCII。 English letters in UNICODE will have zero bytes, but many other languages (eg Chinese) will not. UNICODE中的英文字母将具有零字节,但许多其他语言(例如中文)不会。

If you know what language the text is in, you can test for characters that are not valid in that language -- but it is a bit hit and miss. 如果您知道文本所使用的语言,您可以测试那些语言无效的字符 - 但它有点受欢迎。

In the above, I am using "UNICODE" in the Windows way -- to refer to UTF16 with Intel byte ordering. 在上面,我使用Windows方式使用“UNICODE” - 引用带有Intel字节顺序的UTF16。 However in the real world you could get UTF8 or UTF32 and you might get non-Intel byte ordering. 但是在现实世界中,您可以获得UTF8或UTF32,并且您可能获得非英特尔字节排序。 (Theoretically you could get UTF7, but that is pretty rare). (从理论上讲,你可以获得UTF7,但这种情况非常罕见)。

If you have control over the input files, you can insist that they have BOMs, which makes it easy. 如果您可以控制输入文件,则可以坚持使用BOM,这样可以轻松完成。

Failing that, if you know the language of the files you can try to guess the encoding, but that is less than 100% reliable. 如果不知道,如果您知道文件的语言,您可以尝试猜测编码,但这不到100%可靠。 Otherwise, you might need to ask the operator (if there is one) to specify the encoding. 否则,您可能需要询问操作员(如果有)指定编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM