简体   繁体   English

从C中的文件中读取unicode字符

[英]Reading unicode characters from file in C

I need to read Unicode characters from a file. 我需要从文件中读取Unicode字符。 The only thing I need to do from them is to extract their Unicode number. 我需要做的唯一事情是提取他们的Unicode号码。 I am running on Windows XP using CodeBlock Mingw . 我使用CodeBlock Mingw在Windows XP上运行。

I am doing something like this 我正在做这样的事情

#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif

    #include <stdio.h>
    #include <stdlib.h>
    #include <wchar.h>
    int main()
    {
        wchar_t *filename=L"testunicode.txt";
        FILE *infile;
        infile=_wfopen(filename,L"r");
        wchar_t result=fgetwc(infile);
        wprintf(L"%d",result);//To verify the unicode of character stored in file,print it   
        return 0;
    }

I am getting result as 255 all the time. 我一直得到255的结果。

testunicode.txt is stored in Encoding=Unicode (Created via notepad) testunicode.txt存储在Encoding = Unicode(通过记事本创建)中

The final task is to read from a file which can contain characters from any language but wchar_t is 2 byte only so will it be able to get unicode for all possible characters of languages? 最后的任务是从一个文件中读取,该文件可以包含来自任何语言的字符,但是wchar_t只有2个字节,所以它能够获得所有可能的语言字符的unicode吗?

Need your help... 需要你的帮助...



Thanks everyone for your reply. 谢谢大家的回复。

Now I have changed the code. 现在我已经改变了代码。

#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
int main()
{
    wchar_t *filename=L"testunicode.txt";
    FILE *infile;
    infile=_wfopen(filename,L"r");
    wchar_t  b[2];
    fread(b,2,2,infile);//Read a character from the file
    wprintf(L"%d",b[1]);
    return 0;
}

It prints correct UTF 16 code. 它打印正确的UTF 16代码。 The project where it will be used requires to read characters from different languages of the world. 使用它的项目需要读取世界不同语言的字符。 So will UTF-16 will suffix or should we change the encoding of stored files to UTF-32? 那么UTF-16会后缀还是应该将存储文件的编码更改为UTF-32? Also, here wchar_t is 2 bytes, for UTF-32 we need some data type with 4 bytes. 此外,这里wchar_t是2个字节,对于UTF-32,我们需要一些具有4个字节的数据类型。 How to accomplish that? 怎么做到这一点?

Thanks again for your reply........ 再次感谢你的回复........

Well, the code in your question only reads the first character of your file, so you will have to implement some kind of looping construct in order to process the whole contents of that file. 好吧,你问题中的代码只读取文件的第一个字符,因此你必须实现某种循环结构才能处理该文件的全部内容。

Now, fgetwc() is returning 255 ( 0xFF ) for three reasons: 现在, fgetwc()返回2550xFF )有三个原因:

  • You're not taking the byte-order mark of the file into account, so you end up reading it instead of the actual file contents, 你没有考虑文件的字节顺序标记 ,所以你最终读取它而不是实际的文件内容,

  • You're not specifying a translation mode flag in the mode argument to _wfopen() , so it defaults to text and fgetwc() accordingly tries to read a multibyte character instead of a wide character, 您没有在_wfopen()mode参数中指定转换模式标志,因此默认为textfgetwc()因此尝试读取多字节字符而不是宽字符,

  • 0xFF (the first byte of a little-endian UTF-16 BOM ) is probably not a lead byte in your program's current code page, so fgetwc() returns it without further processing. 0xFF小端 UTF-16 BOM的第一个字节)可能不是程序当前代码页中的前导字节,因此fgetwc()返回它而无需进一步处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM