简体   繁体   中英

C++: Reading alt-key symbols from file

I'm trying to read Alt key symbols from one Unicode UTF-8 file, and write to another.

Input file looks like this>

ỊịỌọỤụṄṅ

Output file looks like this>

239 187 191 225 187 138 225 187 139 225 187 140 225 187 141 225 187 164 225 187 165 225 185 132 225 185 133 ('\\n' after each 3 digit combination, instead of ' ')

Code:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <Windows.h>


///convert as ANSI - display as Unicode
std::wstring test1(const char* filenamein)
{
    std::wifstream fs(filenamein);
    if(!fs.good()) 
    { 
        std::cout << "cannot open input file [" << filenamein << "]\n" << std::endl;  

        return NULL; 
    }

    wchar_t c; 
    std::wstring s;

    while(fs.get(c)) 
    { 
        s.push_back(c); 
        std::cout << '.' << std::flush; 
    }

    return s;

}

int printToFile(const char* filenameout, std::wstring line)
{
    std::wofstream fs;

    fs.open(filenameout);

    if(!fs.is_open())
        return -1;

    for(unsigned i = 0; i < line.length(); i++)
    {
        if(line[i] <= 126)  //if its standard letter just print to file
            fs << (char)line[i];
        else  //otherwise do this.
        {
            std::wstring write = L"";

            std::wostringstream ss;
            ss << (int)line[i];

            write = ss.str();

            fs << write;
            fs << std::endl;
        }
    }

    fs << std::endl;


    //2nd test, also fails
    char const *special_character[] = { "\u2780", "\u2781", "\u2782",
  "\u2783", "\u2784", "\u2785", "\u2786", "\u2787", "\u2788", "\u2789" };

    //prints out four '?'
    fs << special_character[0] << std::endl;
    fs << special_character[1] << std::endl;
    fs << special_character[2] << std::endl;
    fs << special_character[3] << std::endl;

    fs.close();

    return 1;
}

int main(int argc, char* argv[])
{
    std::wstring line = test1(argv[1]);

    if(printToFile(argv[2], line) == 1)
        std::cout << "Writing success!" << std::endl;
    else std::cout << "Writing failed!" << std::endl;



    return 0;
}

What I was expecting was something similar to the values in this table:

http://tools.oratory.com/altcodes.html

Ok, per your code and comments, I understand the following:

  • you have an input file that contains an UTF-8 encoded string
  • you are reading it on Windows into wide characters but without imbuing any locale

So here is what actually happens:

Your code correctly reads the file one byte at a time, as an ANSI file (as if it was win1252 encoded). Your program then display the code value of all the bytes. I can confirm that the list of bytes you show in your post is the utf-8 encode string ỊịỌọỤụṄṅ , except that notepad++ has added a Byte Order Mark (U+FEFF) at the start which is not normally used in UTF8 files - the BOM is the 3 bytes 239 187 191 (in decimal) or 0xef 0xbb 0xbf (in hexa)

So what could you do?

One simple solution (as you are using Windows) would be to ask notepad++ to encode the file as UTF16LE which is the native unicode format in Windows. That way you would actually read the unicode characters.

The other way would be to instruct your code to process the file as UTF8. That would be trivial on Linux, but can be tricky on Windows where UTF8 in only correctly processed since VC2010. This other post from SO shows how to imbue a UTF8 locale in a C++ stream.

I'm sorry for not giving code, but I have only an old VC2008 that does not support UTF8 streams... and I hate giving untested code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM