从文件中读取日文字符的问题 - C.

Question

我正在编写一个程序，它读取一个有近200万行的文件。 该文件的格式为整数ID选项卡，其中包含艺术家姓名字符串。

6821361 Selinsgrove High School Chorus
10151460    greek-Antique
10236365    jnr walker & the all-stars
6878792 Grieg - Kraggerud, Kjekshus
6880556 Mr. Oiseau
6906305 stars on 54 (maxi single)
10584525    Jonie Mitchel
10299729    エリス レジーナ／アントニオ カルロス ジョビン

上面是一个示例，其中包含文件中的一些项目（不是某些行不遵循特定格式）。 我的程序工作文件，直到它从示例到达最后一行然后它无休止地打印エリスレジーナ／アントニオカルロスジョビ\\343\\203 。

struct artist *read_artists(char *fname)
{
    FILE *file;
    struct artist *temp = (struct artist*)malloc(sizeof(struct artist));
    struct artist *head = (struct artist*)malloc(sizeof(struct artist));
    file = fopen("/Users/Daniel/Library/Developer/Xcode/DerivedData/project_Audioscrobbler_Artists-hgwyqpinuoxayzbmvarcjxryqnrz/Build/Products/Debug/artist_data.txt", "r");
    if(file == 0)
    {
        perror("fopen");
        exit(1);
    }
    int artist_ID;
    char artist_name[650];
    while(!feof(file))
    {
        fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name);
        temp = create_play(artist_ID, artist_name, 0, -1);
        head = add_play(head, temp);
        printf("%s\n", artist_name);
    }
    fclose(file);
    //print_plays(head);
    return head;
}

以上是我从文件中读取的代码。 你能帮忙解释一下是什么问题吗？

Answer 1

正如评论所指出的，一个问题是while（！feof（file））链接内容将详细解释为什么这不是一个好主意，但总的来说，引用链接中的一个答案：

（！FEOF（文件））...

...是错误的，因为它测试了一些无关紧要的东西，并且无法测试你需要知道的东西。 结果是您错误地执行了代码，该代码假定它正在访问已成功读取的数据，而事实上这种情况从未发生过。 - Kerrek SB

在您的情况下，这种用法不会导致您的问题，但正如Kerrek解释可能发生的那样，掩盖它。

您可以用fgets(...)替换它：

char lineBuf[1000];//make length longer or shorter for your purpose
file = fopen("/Users/Daniel/Library/Developer/Xcode/DerivedData/project_Audioscrobbler_Artists-hgwyqpinuoxayzbmvarcjxryqnrz/Build/Products/Debug/artist_data.txt", "r");
if(!file) return -1;
while(fgets (lineBuf, sizeof(lineBuf), file))
{
    //process each line here
    //But processing Japanese characters
    //will require special considerations.
    //Refer to the link below for UNICODE tips
}

C和C ++中的Unicode ...

特别是，您需要使用足以包含要处理的不同大小字符的变量类型。 该链接非常详细地讨论了这一点。

这是一段摘录：

 "char" no longer means character I hereby recommend referring to character codes in C programs using a 32-bit unsigned integer type. Many platforms provide a 
“wchar_t”（宽字符）类型，但不幸的是要避免它，因为一些编译器只分配16位 - 不足以表示Unicode。 无论您需要传递单个字符，请将“char”更改为“unsigned int”或类似字符。 “char”类型唯一剩下的用法是指“byte”。

编辑：
在上面的注释中，您说明了它失败的字符串是66个字节长 。 因为您正在读取'char'数组，所以在包含最后一个必要字节之前，完成字符所需的字节被截断一个字节。 ASCII字符可以包含在单个char空间中。 日文字符不能。 如果您使用的是unsigned int数组而不是char数组，则会包含最后一个字节。

Answer 2

OP的代码失败，因为没有检查fscanf()的结果。

fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name);

fscanf()读取了"エリスレジーナ／アントニオカルロスジョビン" 65个char 。 然而，这个以UTF8编码的字符串长度为66.最后一个'ン'是代码227,131,179（八进制343 203 263），只有最后两个被读取。 打印artist_name ，将显示以下内容。

エリス レジーナ／アントニオ カルロス ジョビ\343\203

现在开始问题了。 最后一个char 179保留在file 。 在下一个 fscanf() ，它失败，因为char 179没有转换为int （ "%d" ）。 所以fscanf()返回0.由于代码没有检查fscanf()的结果，它没有意识到artist_ID和artist_name从之前遗留下来，因此打印相同的文本。

作为feof()是从不为真char 179没有被消耗，我们有无限循环。

while(!feof(file))隐藏了这个问题，但没有引起它。

@ryyker提出的fgets()是一种很好的方法。 另一个是：

while (fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name) == 2) {
    temp = create_play(artist_ID, artist_name, 0, -1);
    head = add_play(head, temp);
    printf("%s\n", artist_name);
    }

IOWs，验证*scanf()的结果。

从文件中读取日文字符的问题 - C.

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-11-25 14:52:25

解决方案2
3 2015-11-25 15:31:21

从文件中读取日文字符的问题 - C.

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-11-25 14:52:25

解决方案2 3 2015-11-25 15:31:21

解决方案1
3 已采纳 2015-11-25 14:52:25

解决方案2
3 2015-11-25 15:31:21