在C中读取Unicode文件时出错

Question

I want to read a unicode file in C (Cygwin/GCC) using the following code: 我想使用以下代码在C（Cygwin / GCC）中读取unicode文件：

#include <stdio.h>
#include <stdlib.h>
#include <glib.h>


void split_parse(char* text){
    char** res = g_strsplit(text, "=", 2);
    printf("Key = %s : ", res[0]);
    printf("Value = %s", res[1]);
    printf("\n");
}

int main(int argc, char **argv)
{
    setenv ("CYGWIN", "nodosfilewarning", 1);

    GIOChannel *channel;
    GError *err = NULL;
    int reading = 0;
    const gchar* enc;
    guchar magic[2] = { 0 };
    gsize bytes_read = 0;

    const char* filename = "C:\\CONFIG";


    channel = g_io_channel_new_file (filename, "r", &err);

    if (!channel) {
        g_print("%s", err->message);
        return 1;
    }

    if (g_io_channel_set_encoding(channel, NULL, &err) != G_IO_STATUS_NORMAL) {
        g_print("g_io_channel_set_encoding: %s\n", err->message);
        return 1;
    }

    if (g_io_channel_read_chars(channel, (gchar*) magic, 2, &bytes_read, &err) != G_IO_STATUS_NORMAL) {
        g_print("g_io_channel_read_chars: %s\n", err->message);
        return 1;
    }

    if (magic[0] == 0xFF && magic[1] == 0xFE)
    {
        enc = "UTF-16LE";
    }
    else if (magic[0] == 0xFE && magic[1] == 0xFF)
    {
        enc = "UTF-16BE";
    }
    else
    {
        enc = "UTF-8";
        if (g_io_channel_seek_position(channel, 0, G_SEEK_CUR, &err) == G_IO_STATUS_ERROR)
        {
            g_print("g_io_channel_seek: failed\n");
            return 1;
        }
    }

    if (g_io_channel_set_encoding (channel, enc, &err) != G_IO_STATUS_NORMAL) {
        g_print("%s", err->message);
        return 1;
    }

    reading = 1;
    GIOStatus status;
    char* str = NULL;
    size_t len;

    while(reading){

        status = g_io_channel_read_line(channel, &str, &len, NULL, &err);
        switch(status){
            case G_IO_STATUS_EOF:
                reading = 0;
                break;
            case G_IO_STATUS_NORMAL:
                if(len == 0) continue;
                split_parse(str);
                break;
            case G_IO_STATUS_AGAIN: continue;
            case G_IO_STATUS_ERROR:
            default:
                //throw error;
                reading = 0;
                break;
        }
    }

    g_free(str);
    g_io_channel_unref(channel);

    return(EXIT_SUCCESS);
}

The file (C:\\CONFIG) content is as follows: 文件（C：\\ CONFIG）的内容如下：

h-debug="1"
name=ME
ÃÆÿÐ®©=2¾1¼

While reading it I am always getting the following error message at "g_io_channel_read_line" inside the while loop: 在阅读它的同时，我总是在while循环内的“ g_io_channel_read_line”处收到以下错误消息：

0x800474f8 "Invalid byte sequence in conversion input" 0x800474f8“转换输入中无效的字节序列”

What am I doing wrong? 我究竟做错了什么？ How to read a file like this in C using glib? 如何使用glib在C中读取这样的文件？

EDIT: Hexdump of the file 编辑：文件的十六进制转储

在此处输入图片说明

Answer 1

Your file contains the 3-byte UTF8 BOM of (EF BB BF). 您的文件包含（EF BB BF）的3字节UTF8 BOM。 byte-order-mark. 字节顺序标记。

Your code defaults to UTF8, but does not consume the BOM. 您的代码默认为UTF8，但不使用BOM。

channel, 0, G_SEEK_CUR, &err

s/b s / b

channel, 3, G_SEEK_CUR, &err

Further, I would recommend extending your magic code to read 4 bytes and affirmatively discern the BOM. 此外，我建议将您的magic代码扩展为读取4个字节，并肯定地识别BOM。

If you do not find a BOM, you could assume encoding NULL which I think is binary. 如果找不到BOM表，则可以假设编码为NULL，我认为这是二进制的。 Or throw an error Or fix the wayward text file Or, if your are pedantic, sequentially try all known encoding types. 或抛出错误或修复随意的文本文件；或者，如果您是书呆子，请依次尝试所有已知的编码类型。

UTF32BE "\\x00\\x00\\xFE\\xFF" UTF32BE“ \\ x00 \\ x00 \\ xFE \\ xFF”
UTF32LE "\\xFF\\xFE\\x00\\x00" UTF32LE“ \\ xFF \\ xFE \\ x00 \\ x00”
UTF8 "\\xEF\\xBB\\xBF" UTF8“ \\ xEF \\ xBB \\ xBF”
UTF16BE "\\xFE\\xFF" UTF16BE“ \\ xFE \\ xFF”
UTF16LE "\\xFF\\xFE" UTF16LE“ \\ xFF \\ xFE”
NULL for binary 二进制为NULL

在C中读取Unicode文件时出错

问题描述

1 个解决方案

解决方案1
1 2013-06-30 05:16:08

在C中读取Unicode文件时出错

问题描述

1 个解决方案

解决方案1 1 2013-06-30 05:16:08

解决方案1
1 2013-06-30 05:16:08