简体   繁体   中英

I get “Invalid utf 8 error” when checking string, but it seems correct when i use std::cout

I am writing some code that must read utf 8 encoded text files, and send them to OpenGL.

Also using a library which i downloaded from this site: http://utfcpp.sourceforge.net/

When i write down this i can show the right images on OpenGL window:

std::string somestring = "abcçdefgğh"; // Convert string to utf32 encoding.. // I also set local on program startup.

But when i read the utf8 encoded string from file:

  • The library warns me about that the string has not a valid utf encoding
  • I can't send the 'read from file' string to OpenGL. It crashes.
  • But i can still use std::cout for the string that i read from file (it looks right).

I use this code to read from file:

void something(){
std::ifstream ifs("words.xml");
std::string readd;
if(ifs.good()){
while(!ifs.eof()){
std::getline(ifs, readd);
// do something..
}
}
}

Now the question is:

  • If the string which is read from file is not correct, how does it look as expected when i check it with std::cout?

  • How can i get this issue solved?

Thanks in advance:)

The shell to which you write output is probably rather robust against characters it doesn't understand. It seems, not all of the used software is. It should, however, be relatively straight forward to verify if you byte sequence is a valid UTF-8 sequence: the UTF-8 encoding is relatively straight forward:

  • each code point starts with a byte representing the number of bytes to be read and the first couple of bytes:
    • if the high bit is 0, the code point consists of one byte represented by the 7 lower bits
    • otherwise the number of leading 1 bits represent the total number of bytes followed by a zero bit (obiously) and the remaining bits become the high bits of the code point
  • since 1 byte is already represented, bytes with the high bit set and the next bit not set are continuation bytes: the lower 6 bits are part of the representation of the code point

Based on these rules, there are two things which can go wrong and make the UTF-8 invalid:

  1. a continuation byte is encountered at a point where a start byte is expected
  2. there was a start byte indicating more continuation bytes then followed

I don't have code around which could indicate where things are going wrong but it should be fairly straight forward to write such code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM