简体   繁体   English

在纯C中打开Unicode文件

[英]Opening a Unicode file in pure C

I am trying to open a .txt file that is wholly Chinese. 我正在尝试打开一个完全是中文的.txt文件。 Can I use normal fopen/fclose procedures to it even though the stream would be 100% Unicode or are there any exlusive tools for handling wide characters? 我可以使用普通的fopen / fclose程序,即使流是100%Unicode还是有任何用于处理宽字符的exlusive工具? I'd be grateful for precise answers, I am a beginner programmer. 我很感激准确的答案,我是一名初学程序员。 I am using Linux with standard gcc. 我使用的是标准gcc的Linux。

I will attach my code, it compiles with no error but upon execution I get segmentation fault. 我将附加我的代码,它编译没有错误,但在执行时我得到分段错误。 I don't know what is wrong with it. 我不知道它有什么问题。 The point of this programme is to copy each string of Chinese signs in which a specific sign from a given set is to be found and to write it in a separate file. 该程序的要点是复制每个中文符号字符串,其中要找到给定集合中的特定符号,并将其写入单独的文件中。

#include<stdio.h>
#include<stdlib.h>
#include<wchar.h>
#include <locale.h>
#define PLIK_IN in /*filenames*/
#define PLIK_OUT out
#define LKON 49 /*specifying the length of a string on the left from a desired sign*/
#define PKON 50 /*...and on the right*/
int wczytaj_pliki(FILE*, FILE*); /*open file*/
void krocz_po_pliku(FILE*, FILE*); /*search through file*/
int slownik(wchar_t); /*compare signs*/
void zapisz_pliki(FILE*, FILE*); /*write to file*/

void main(void)
{
    FILE *bin,*bout;
    setlocale(LC_CTYPE, "");

    wczytaj_pliki(bin, bout);
    krocz_po_pliku(bin, bout);
    zapisz_pliki(bin, bout);
}/*main*/

int slownik(wchar_t znak) /*compare characters*/
{
    wchar_t gznak1 = L'股', gznak2 = L'利', gznak3 = L'红';
    if ( ( znak == gznak1) || (znak == gznak2) || (znak == gznak3) ) return 1;
    return 0;
}/*slownik*/

void krocz_po_pliku(FILE* bin, FILE* bout) /*search through file*/
{
    wchar_t wch;
    wchar_t* kontekst;
    int i = 0, j, step = LKON, counter = 0, token = 0;

    while ( (wch = getwchar() ) != EOF )
    {
        if (!token) /*comparing consecutive signs*/
    {
        if ( slownik(wch) == 1 )
        {
            counter++;
            fprintf(bout,"###Wystapienie %d.\n\n", counter);
            if ( i<step ) step = i;
            fseek(bin,-step,1);
            j=0, token = 1;
        }/*if*/
        else i++;
    }/*if*/
    else /*writing consecutive signs within context*/
    {
        if ( j < LKON + PKON)
        {
            putwc(wch, bout);
            j++;
        }/*if*/
        else
        {
            fprintf(bout,"###\n\n");
            fflush(bout);
            token = 0;
        }/*else*/
    }/*else*/
    }/*while*/
        printf("Znalazlem %d wystapien\n", counter);
}/*krocz_po_pliku*/

int wczytaj_pliki(FILE* bin, FILE* bout)
{
    bin=fopen("PLIK_IN","r");
    bout=fopen("PLIK_OUT","w");
    rewind(bin);
    if(bin==NULL || bout==NULL)
{
    printf("Blad plikow\n");
    exit(0);
}/*if*/
    return 1;
}/*wczytaj pliki*/

void zapisz_pliki(FILE* bin, FILE* bout)
{
fclose(bin);
fclose(bout);
}

Yes, fopen can open a file that contains any data, including Unicode data, as long as you can represent the filename in a char*. 是的,fopen可以打开包含任何数据的文件,包括Unicode数据,只要您可以在char *中表示文件名。 (On some platforms, namely Windows, files may have names that cannot be represented in a char*). (在某些平台上,即Windows,文件可能具有无法在char *中表示的名称)。

You will want to open the file in binary mode to prevent any new line substitution that may be done (unless the Unicode encoding is UTF-8 and then it doesn't matter), because the substitution will be done in terms of chars. 您将需要以二进制模式打开文件以防止可能执行的任何新行替换(除非Unicode编码是UTF-8然后无关紧要),因为替换将根据字符进行。 Also, if the code units are more than one byte you will need to make sure you're reading them with the correct endianness. 此外,如果代码单元超过一个字节,您需要确保使用正确的字节顺序读取它们。

Note that wchar_t isn't necessarily Unicode an may not be the right type for whatever Unicode encoding is being used by your files. 请注意,wchar_t不一定是Unicode,对于文件使用的任何Unicode编码,它可能不是正确的类型。 And if your program supports multiple Unicode encodings do not use BOMs to guess which encoding a file uses. 如果您的程序支持多种Unicode编码,请不要使用BOM来猜测文件使用的编码。

Your problem might be caused by the fact, that you 您的问题可能是由您造成的

#define PLIK_IN in /*filenames*/

and then 然后

bin=fopen("PLIK_IN","r");

Your programme is trying to open a file named PLIK_IN and not a file named in . 您的程序正在尝试打开名为PLIK_IN的文件in而不是名为的文件。 If PLIK_IN doesn't exist, fopen returns 0 . 如果PLIK_IN不存在,则fopen 返回 0 Passing 0 to rewind causes your executable to die. 传递0rewind会导致可执行文件死亡。

If you'd like to open in , you should 如果你想开in ,你应该

#define PLIK_IN "in" /*filenames*/
/* ... */
bin=fopen(PLIK_IN,"r");

Same goes for PLIK_OUT . PLIK_OUT

Last but not the least, remember to code in English. 最后但并非最不重要,记得用英语编码。 It's a lingua franca in our business and using it significantly increases the number of people who can help you out :) 它是我们业务中的通用语言 ,使用它可以显着增加可以帮助你的人数:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM