简体   繁体   中英

How to read file with special characters? - C

I have a countries.txt document in which lists countries' names in spanish language. That means that there are " ´ " and " ñ " characters for example.

I have a small function that I use to count the lines in the document, which originally was made with fgets() function, and I edited it so it uses fgetws() , since I understand that special characters shoud be stored in wchar_t variables.

  int linesCount = 0;
    wchar_t line[MAX_SIZE];

    while(fgetws(line, sizeof(line), f) != NULL){
        linesCount++;
    }
    rewind(f);

    return linesCount;
}

1) If the function finds a string which contains " ´ ", the program crashes. 2) If there are not any special characters found, valgrind finds a lot more memory leaks instead of just 1 if there is at least one special character like "ñ".

This is the main:

int main (void)
{
 setlocale(LC_ALL, "spanish");
 countries = fopen("countries.txt", "r");
 int counCount = count_lines(countries);
 fclose(countries);
}

This is the first part of countries.txt:

Aruba
Angola
Albania
Andorra
Argelia
Armenia
Austria
Alemania
Antártida
Argentina

The program crashes when it reachs to "Antártida", which has the " á " letter.

I attach the error valgrind shows:

1 errors in context 1 of 1:
==16211== Conditional jump or move depends on uninitialised value(s)
==16211==    at 0x4FCB443: __wmemchr_avx2 (memchr-avx2.S:97)
==16211==    by 0x4EBE164: _IO_getwline_info (iogetwline.c:86)
==16211==    by 0x4EBDD2C: fgetws (iofgetws.c:53)
==16211==    by 0x108BC3: count_lines (people_generator.c:10)
==16211==    by 0x108B3C: main (main.c:15)
==16211==  Uninitialised value was created by a heap allocation
==16211==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==16211==    by 0x4EBB858: _IO_wfile_doallocate (wfiledoalloc.c:79)
==16211==    by 0x4ECA378: _IO_doallocbuf (genops.c:365)
==16211==    by 0x4EC172B: _IO_wfile_underflow (wfileops.c:179)
==16211==    by 0x4EBF691: _IO_wdefault_uflow (wgenops.c:204)
==16211==    by 0x4EBE1C0: _IO_getwline_info (iogetwline.c:61)
==16211==    by 0x4EBDD2C: fgetws (iofgetws.c:53)
==16211==    by 0x108BC3: count_lines (people_generator.c:10)
==16211==    by 0x108B3C: main (main.c:15)

The file, as kept on dis, don't use "wchars" - it will be encoded in an "encoding", most often utf-8, or latin-1.

What you might be getting there is that "spanish" does not give information about the charset encoding - so although you get no error on your call to `setlocale, you are likely trying to read an utf-8 file (which has a multi-byte encoding), with a charmap (one byte per character) encoding.

If you only have to count lines, just use chars, and your program will work as you expect.

So, instead of trying to guess, read this now : https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

After that you should be able to determine your file encoding, using other tools at your disposal, and then set the correct encoding onyour set-locale call. One of "es_ES.UTF-8" or "es_ES.ISO8859-1" should work.

Then, if you have a "real-world" task of having to deal with international text-files this simple, I strongly suggest you to move away from C and use a higher-level language. ou will still have to know the file encoding - but life will be an order of magnitude (at least) easier.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM