如何逐行读取unicode（utf-8）/二进制文件

Question

Hi programmers, 嗨程序员，

I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!. 我想逐行读取由记事本创建的Unicode（UTF-8）文本文件，我不想在屏幕上显示Unicode字符串，我只想阅读并比较字符串！

This code read ANSI file line by line, and compare the strings 此代码逐行读取ANSI文件，并比较字符串

What i want 我想要的是

Read test_ansi.txt line by line 逐行阅读test_ansi.txt

if the line = "b" print "YES!" 如果line =“b”打印“YES！”

else print "NO!" 别打印“不！”

read_ansi_line_by_line.c read_ansi_line_by_line.c

#include <stdio.h>

int main()
{
    char *inname = "test_ansi.txt";
    FILE *infile;
    char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
    char line_number;

    infile = fopen(inname, "r");
    if (!infile) {
        printf("\nfile '%s' not found\n", inname);
        return 0;
    }
    printf("\n%s\n\n", inname);

    line_number = 0;
    while (fgets(line_buffer, sizeof(line_buffer), infile)) {
        ++line_number;
        /* note that the newline is in the buffer */
        if (strcmp("b\n", line_buffer) == 0 ){
            printf("%d: YES!\n", line_number);
        }else{
            printf("%d: NO!\n", line_number,line_buffer);
        }
    }
    printf("\n\nTotal: %d\n", line_number);
    return 0;
}

test_ansi.txt test_ansi.txt

a
b
c

Compiling 编译

gcc -o read_ansi_line_by_line read_ansi_line_by_line.c

Output 产量

test_ansi.txt

1: NO!
2: YES!
3: NO!


Total: 3

Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode! 现在我需要读取由记事本创建的Unicode（UTF-8）文件，经过6个多月的时间我没有发现C中的任何好的代码/库可以读取以UTF-8编码的文件！，我不知道为什么但是我认为标准C不支持Unicode！

Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file! 读取Unicode二进制文件OK！，但问题是二进制文件已经以二进制模式创建！这意味着如果我们想要读取由记事本创建的Unicode（UTF-8）文件，我们需要从UTF-8翻译它文件到BINARY文件！

This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC 此代码将Unicode字符串写入二进制文件，注意C文件以UTF-8编码并由GCC编译

What i want 我想要的是

Write the Unicode char "ب" to test_bin.dat 将Unicode字符“ب”写入test_bin.dat

create_bin.c create_bin.c

#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif

#include <stdio.h>
#include <wchar.h>

int main()
{
     /*Data to be stored in file*/
     wchar_t line_buffer[BUFSIZ]=L"ب";
     /*Opening file for writing in binary mode*/
     FILE *infile=fopen("test_bin.dat","wb");
     /*Writing data to file*/
     fwrite(line_buffer, 1, 13, infile);
     /*Closing File*/
     fclose(infile);

    return 0;
}

Compiling 编译

gcc -o create_bin create_bin.c

Output 产量

create test_bin.dat

Now i want read the binary file line by line and compare! 现在我想逐行读取二进制文件并进行比较！

What i want 我想要的是

Read test_bin.dat line by line if the line = "ب" print "YES!" 如果line =“ب”打印“YES！”，请逐行读取test_bin.dat。 else print "NO!" 别打印“不！”

read_bin_line_by_line.c read_bin_line_by_line.c

#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif

#include <stdio.h>
#include <wchar.h>

int main()
{
    wchar_t *inname = L"test_bin.dat";
    FILE *infile;
    wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */

    infile = _wfopen(inname,L"rb");
    if (!infile) {
        wprintf(L"\nfile '%s' not found\n", inname);
        return 0;
    }
    wprintf(L"\n%s\n\n", inname);

    /*Reading data from file into temporary buffer*/
    while (fread(line_buffer,1,13,infile)) {
        /* note that the newline is in the buffer */
        if ( wcscmp ( L"ب" , line_buffer ) == 0 ){
             wprintf(L"YES!\n");
        }else{
             wprintf(L"NO!\n", line_buffer);
        }
    }
    /*Closing File*/
    fclose(infile);
    return 0;
}

Output 产量

test_bin.dat

YES!

THE PROBLEM 问题

This method is VERY LONG! 这种方法非常长！ and NOT POWERFUL (im beginner in software engineering) 并且没有力量（我是软件工程的初学者）

Please any one know how to read Unicode file ? 请任何人知道如何阅读Unicode文件？ (i know its not easy!) Please any one know how to convert Unicode file to Binary file ? （我知道它不容易！）请知道如何将Unicode文件转换为二进制文件？ (simple method) Please any one know how to read Unicode file in binary mode ? （简单方法）请知道如何在二进制模式下读取Unicode文件？ (im not sure) （我不确定）

Thank You. 谢谢。

Answer 1

A nice property of UTF-8 is that you do not need to decode in order to compare it. UTF-8的一个很好的特性是，你不需要为了把它比作解码。 The order returned from strcmp will be the same whether you decode it first or not. 无论您是先解码，strcmp返回的顺序都是相同的。 So just read it as raw bytes and run strcmp. 所以只需将其作为原始字节读取并运行strcmp。

Answer 2

I found a solution to my problem, and I would like to share the solution to any one interested in reading UTF-8 file in C99. 我找到了解决问题的方法，我想与任何有兴趣在C99中读取UTF-8文件的人分享解决方案。

void ReadUTF8(FILE* fp)
{
    unsigned char iobuf[255] = {0};
    while( fgets((char*)iobuf, sizeof(iobuf), fp) )
    {
            size_t len = strlen((char *)iobuf);
            if(len > 1 &&  iobuf[len-1] == '\n')
                iobuf[len-1] = 0;
            len = strlen((char *)iobuf);
            printf("(%d) \"%s\"  ", len, iobuf);
            if( iobuf[0] == '\n' )
                printf("Yes\n");
            else
                printf("No\n");
    }
}

void ReadUTF16BE(FILE* fp)
{
}

void ReadUTF16LE(FILE* fp)
{
}

int main()
{
    FILE* fp = fopen("test_utf8.txt", "r");
    if( fp != NULL)
    {
        // see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
        // encoding
        unsigned char b[3] = {0};
        fread(b,1,2, fp);
        if( b[0] == 0xEF && b[1] == 0xBB)
        {
            fread(b,1,1,fp); // 0xBF
            ReadUTF8(fp);
        }
        else if( b[0] == 0xFE && b[1] == 0xFF)
        {
            ReadUTF16BE(fp);
        }
        else if( b[0] == 0 && b[1] == 0)
        {
            fread(b,1,2,fp); 
            if( b[0] == 0xFE && b[1] == 0xFF)
                ReadUTF16LE(fp);
        }
        else
        {
            // we don't know what kind of file it is, so assume its standard
            // ascii with no BOM encoding
            rewind(fp);
            ReadUTF8(fp);
        }
    }        

    fclose(fp);
}

Answer 3

fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. 如果您使用Visual Studio 2005及更高版本，fgets（）可以解码UTF-8编码的文件。 Change your code like this: 像这样更改你的代码：

infile = fopen(inname, "r, ccs=UTF-8");

Answer 4

In this article a coding and decoding routine is written and it is explained how the unicode is encoded: 在本文中，编写了一个编码和解码例程，并解释了如何编码unicode：

http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/ http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/

It can be easily adjusted to C. Simply encode your ANSI or decode the UTF-8 String and make a byte compare 它可以轻松调整为C.只需编码ANSI或解码UTF-8字符串并进行字节比较

EDIT: After the OP said that it is too hard to rewrite the function from C++ here a template: 编辑：OP之后说这个模板在C ++中重写函数太难了：

What is needed: 需要什么：
+ Free the allocated memory (or wait till the process ends or ignore it) +释放分配的内存（或等到进程结束或忽略它）
+ Add the 4 byte functions +添加4字节函数
+ Tell me that short and int is not guaranteed to be 2 and 4 bytes long (I know, but C is really stupid !) and finally +告诉我short和int不能保证长2和4个字节（我知道，但C真的很蠢！）最后
+ Find some other errors +找一些其他错误

#include <stdlib.h>
#include <string.h>

#define         MASKBITS                0x3F
#define         MASKBYTE                0x80
#define         MASK2BYTES              0xC0
#define         MASK3BYTES              0xE0
#define         MASK4BYTES              0xF0
#define         MASK5BYTES              0xF8
#define         MASK6BYTES              0xFC

char* UTF8Encode2BytesUnicode(unsigned short* input)
{
   int size = 0,
       cindex = 0;
   while (input[size] != 0)
     size++;
   // Reserve enough place; The amount of 
   char* result = (char*) malloc(size);
   for (int i=0; i<size; i++)
   {
      // 0xxxxxxx
      if(input[i] < 0x80)
      {
         result[cindex++] = ((char) input[i]);
      }
      // 110xxxxx 10xxxxxx
      else if(input[i] < 0x800)
      {
         result[cindex++] = ((char)(MASK2BYTES | input[i] >> 6));
         result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
      }
      // 1110xxxx 10xxxxxx 10xxxxxx
      else if(input[i] < 0x10000)
      {
         result[cindex++] = ((char)(MASK3BYTES | input[i] >> 12));
         result[cindex++] = ((char)(MASKBYTE | input[i] >> 6 & MASKBITS));
         result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
      }
   }
}

wchar_t* UTF8Decode2BytesUnicode(char* input)
{
  int size = strlen(input);
  wchar_t* result = (wchar_t*) malloc(size*sizeof(wchar_t));
  int rindex = 0,
      windex = 0;
  while (rindex < size)
  {
      wchar_t ch;

      // 1110xxxx 10xxxxxx 10xxxxxx
      if((input[rindex] & MASK3BYTES) == MASK3BYTES)
      {
         ch = ((input[rindex] & 0x0F) << 12) | (
               (input[rindex+1] & MASKBITS) << 6)
              | (input[rindex+2] & MASKBITS);
         rindex += 3;
      }
      // 110xxxxx 10xxxxxx
      else if((input[rindex] & MASK2BYTES) == MASK2BYTES)
      {
         ch = ((input[rindex] & 0x1F) << 6) | (input[rindex+1] & MASKBITS);
         rindex += 2;
      }
      // 0xxxxxxx
      else if(input[rindex] < MASKBYTE)
      {
         ch = input[rindex];
         rindex += 1;
      }

      result[windex] = ch;
   }
}

char* getUnicodeToUTF8(wchar_t* myString) {
  int size = sizeof(wchar_t);
  if (size == 1)
    return (char*) myString;
  else if (size == 2)
    return UTF8Encode2BytesUnicode((unsigned short*) myString);
  else
    return UTF8Encode4BytesUnicode((unsigned int*) myString);
}

Answer 5

I know I am bad... but you don't even take under consideration BOM! 我知道我很糟糕......但你甚至不考虑BOM！ Most examples here will fail. 这里的大多数例子都会失败

EDIT: 编辑：

Byte Order Marks are a few bytes at the beginnig of the file, which can be used to identify the encoding of the file. 字节顺序标记在文件的beginnig处是几个字节，可用于标识文件的编码。 Some editors add them, and many times they just break things in faboulous ways (I remember fighting a PHP headers problems for several minutes because of this issue). 一些编辑添加它们，很多时候它们只是以各种方式破坏事物（我记得因为这个问题而在几分钟内解决PHP头问题）。

Some RTFM: http://en.wikipedia.org/wiki/Byte_order_mark http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx What is XML BOM and how do I detect it? 一些RTFM： http ： //en.wikipedia.org/wiki/Byte_order_mark http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx 什么是XML BOM以及如何检测它？

Answer 6

just to settle the BOM argument. 只是为了解决BOM参数。 Here is a file from notepad 这是记事本中的文件

 [paul@paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt
 0000000 ef bb bf 61 0d 0a 62 0d 0a 63
 0000012

with a BOM at the start 在开始时使用BOM

Personally I dont think there should be a BOM (since its a byte format) but thats not the point 我个人认为不应该有一个BOM（因为它是一个字节格式），但这不是重点

如何逐行读取unicode（utf-8）/二进制文件

问题描述

What i want 我想要的是

read_ansi_line_by_line.c read_ansi_line_by_line.c

test_ansi.txt test_ansi.txt

Compiling 编译

Output 产量

What i want 我想要的是

create_bin.c create_bin.c

Compiling 编译

Output 产量

What i want 我想要的是

read_bin_line_by_line.c read_bin_line_by_line.c

Output 产量

THE PROBLEM 问题

6 个解决方案

解决方案1
6 2010-01-22 22:21:14

解决方案2
5 已采纳 2010-01-25 20:33:46

解决方案3
2 2010-01-21 22:27:22

解决方案4
2 2010-01-21 22:32:02

解决方案5
2 2010-01-21 22:57:18

解决方案6
0 2010-01-23 01:01:00

如何逐行读取unicode（utf-8）/二进制文件

问题描述

What i want 我想要的是

read_ansi_line_by_line.c read_ansi_line_by_line.c

test_ansi.txt test_ansi.txt

Compiling 编译

Output 产量

What i want 我想要的是

create_bin.c create_bin.c

Compiling 编译

Output 产量

What i want 我想要的是

read_bin_line_by_line.c read_bin_line_by_line.c

Output 产量

THE PROBLEM 问题

6 个解决方案

解决方案1 6 2010-01-22 22:21:14

解决方案2 5 已采纳 2010-01-25 20:33:46

解决方案3 2 2010-01-21 22:27:22

解决方案4 2 2010-01-21 22:32:02

解决方案5 2 2010-01-21 22:57:18

解决方案6 0 2010-01-23 01:01:00

解决方案1
6 2010-01-22 22:21:14

解决方案2
5 已采纳 2010-01-25 20:33:46

解决方案3
2 2010-01-21 22:27:22

解决方案4
2 2010-01-21 22:32:02

解决方案5
2 2010-01-21 22:57:18

解决方案6
0 2010-01-23 01:01:00