简体   繁体   中英

How to find a encoding of txt file in c++?

. I am new to c++. I have to find out the type of encoding the file contains which is passed by user. But i am not aware of how to check the encoding of file . so what i need is to print whether the file is unicode or ansi or unicode big endian or utf8.I have searched a lot but unable to find the solution. Till now i have done is i have opened a file :

#include "stdafx.h"
#include <iostream.h>
#include <stdio.h>
#include<conio.h>
#include <fstream>
using namespace std;



int _tmain(int argc, _TCHAR* argv[])
{
    fstream f;
    f.open("c:\abc.txt", fstream::in | fstream::out); /* Read-write. */


    getch();
    return 0;
}

SO please can anyone tell me the code solution to this.

what if i am accessing notepad file?

Thanx in advance..

You cannot.

The best thing you can do is to guess it or save encoding as part of your file structure (if you can).

Here i have found a way to detect the notepad file ,whether it is Unicode,Big Endian,UTF8 or simple ANSI file:

I found that when i save file in notepad by default it stores Byte of Mark(BOM) at the start of file.So i decided to use it as per earlier suggestions in this question.

First of all i read 1 byte of my file. I was already known that 1. if file is Unicode file then its first two charactors stores FE FF ie254 255 is decimal equivalent of it. 2. if file is UTF8 file then its first charactors stores FF and 239 is decimal equivalent of it.

here is code :

#include<conio.h>
#include<stdio.h>
#include<string.h>
int main()
{
        FILE *fp=NULL;
        int c;
        int a[2];
        int i;
        fp=fopen("c:\\abc.txt","rb");

        if (fp != NULL)
        {
            while (i<=3)
            {
                        c = fgetc(fp);    
                        printf("%d",c);
                            if(c==254)
                            {
                                printf("Unicode Big Endian File");
                            }
                            else if(c==255)
                            {
                                printf("Unicode Little Endian File");
                            }
                            else if(c==239)
                            {
                                printf("UTF8  file");
                            }
                            else 
                            {
                                printf("ANSI File");
                            }

              }
              fclose(fp);

       }


        getchar();

    return 0;
}

This worked fine for me.Hope will work for others also.

You cannot know what a encoding a text file has. One way to do it would be to look for the BOM at the beginning of the file, and that would tell you whether the text is in Unicode . However, the BOM is not mandatory, so you cannot rely on that in order to differentiate Unicode from other encodings.

A very common way to present this problem is that there is no such thing as plain text .

I'm Spanish, and you can easily find here text files in 7-bit ASCII, extended ASCII, ISO-8859-1 (aka Latin 1, which includes many common extra characters needed for western europe), and also UTF in its varios flavours.

Hope this somehow helps.

Files generally indicate their encoding with a file header.
And as others suggested you can never be sure what encoding a file is really using .

Follow these links to get a general idea :
Using Byte Order Marks
FILE SIGNATURES TABLE

As discussed here , the only thing you can do is guess in the best order which is most likely to throw out invalid matches .

You should check, in this order:

  • Is there a UTF-16 BOM at the beginning? Then it's probably UTF-16. Use the BOM as indicator whether it's big endian or little endian, then check the rest of the file whether it conforms.
  • Is there a UTF-8 BOM at the beginning? Then it's probably UTF-8. Check the rest of the file.
  • If the above didn't result in a positive match, check if the entire file is valid UTF-8. If it is, it's probably UTF-8.
  • If the above didn't result in a positive match, it's probably ANSI.

使用 Notepad++ 打开您的文件并转到顶部菜单上的编码以查看文件的编码类型请参见此处

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM