简体   繁体   中英

Read multi-language file - wchar_t vs char?

It's a horrible experience for me to get understanding of unicodes, locales, wide characters and conversion.

I need to read a text file which contains Russian and English, Chinese and Ukrainian characters all at once

My approach is to read the file in byte-chunks, then operate on the chunk, on a separate thread for fast reading. (Link)

This is done using std::ifstream.read(myChunkBuffer, chunk_byteSize)

However, I understand that there is no way any character from my multi-lingual file can be represented via 255 combinations, if I stick to char .


For that matter I converted everything into wchar_t and hoped for the best.

I also know about Sys.setlocale(locale = "Russian") (Link) but doesn't it then interpret each character as Russian? I wouldn't know when to flip between my 4 languages as I am parsing my bytes.

On Windows OS, I can create a .txt file and write "Привет! Hello!" in the program Notepad++, which will save file and re-open with the same letters. Does it somehow secretly add invisible tokens after each character, to know when to interpret as Russian, and when as English?


My current understanding is: have everything as wchar_t (double-byte), interpret any file as UTF-16 (double-byte) - is it correct?

Also, I hope to keep the code cross-platform.

Sorry for noob

Unfortunately standard c++ does not have any real support for your situation. (eg unicode in c++-11 )

You will need to use a text-handling library that does support it. Something like this one

The most important question is, what encoding that text file is in. It is most likely not a byte encoding, but Unicode of some sort (as there is no way to have Russian and Chinese in one file otherwise, AFAIK). So... run file <textfile.txt> or equivalent, or open the file in a hex editor, to determine encoding (could be UTF-8, UTF-16, UTF-32, something-else-entirely), and act appropriately.

wchar_t is, unfortunately, rather useless for portable coding. Back when Microsoft decided what that datatype should be, all Unicode characters fit into 16 bit, so that is what they went for. When Unicode was extended to 21 bit, Microsoft stuck with the definition they had, and eventually made their API work with UTF-16 encoding (which breaks the "wide" nature of wchar_ ). "The Unixes", on the other hand, made wchar_t 32 bit and use UTF-32 encoding, so...

Explaining the different encodings goes beyond the scope of a simple Q&A. There is an article by Joel Spolsky (" The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) ") that does a reasonably good job of explaining Unicode though. There are other encodings out there, and I did a table that shows the ISO/IEC 8859 encodings and common Microsoft codepages side by side .

C++11 introduced char16_t (for UTF-16 encoded strings) and char32_t (for UTF-32 encoded strings), but several parts of the standard are not quite capable of handling Unicode correctly (toupper / tolower conversions, comparison that correctly handles normalized / unnormalized strings, ...). If you want the whole smack, the go-to library for handling all things Unicode (including conversion to / from Unicode to / from other encodings) in C/C++ is ICU .

Hokay, let's do this. Let's provide a practical solution to the specific problem of reading text from a UTF-8 encoded file and getting it into a wide string without losing any information.

Once we can do that, we should be OK because the utility functions presented here will handle all UTF-8 to wide-string conversion (and vice-versa) in general and that's the key thing you're missing.

So, first, how would you read in your data? Well, that's easy. Because, at one level, UTF-8 strings are just a sequence of chars , you can, for many purposes, simply treat them that way. So you just need to do what you would do for any text file, eg:

std::ifstream f;
f.open ("myfile.txt", std::ifstream::in);
if (!f.fail ())
{
    std::string utf8;
    f >> utf8;
    // ...
}

So far so good. That all looks easy enough.

But now, to make processing the string we just read in easier (because handling multi-byte strings in code is a total pain), we need to convert it to a so-called wide string before we try to do anything with it. There are actually a few flavours of these (because of the uncertainty surrounding just how 'wide' wchar_t actually is on any particular platform), but for now I'll stick with wchar_t to keep things simple, and doing that conversion is actually easier than you might think.

So, without further ado, here are your conversion functions (which is what you bought your ticket for):

#include <string>
#include <codecvt>
#include <locale>

std::string narrow (const std::wstring& wide_string)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.to_bytes (wide_string);
}

std::wstring widen (const std::string& utf8_string)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.from_bytes (utf8_string);
}

My, that was easy, why did those tickets cost so much in the first place?

I imagine that's all I really need to say. I think, from what you say in your question, that you already had a fair idea of what you wanted to be able to do, you just didn't know how to achieve it (and perhaps hadn't quite joined up all the dots yet) but just in case there is any lingering confusion, once you do have a wide string you can freely use all the methods of std::basic_string on it and everything will 'just work'. And if you need to convert to back to a UTF-8 string to (say) write it out to a file, well, that's trivial now.

Test program over at the most excellent Wandbox . I'll touch this post up later, there are still a few things to say. Time for breakfast now :) Please ask any questions in the comments.

Notes (added as an edit):

  • codecvt is deprecated in C++17 (not sure why), but if you limit its use to just those two functions then it's not really anything to worry about. One can always rewrite those if and when something better comes along (hint, hint, dear standards persons).
  • codecvt can, I believe, handle other character encodings, but as far as I'm concerned, who cares?
  • if std::wstring (which is based on wchar_t ) doesn't cut it for you on your particular platform, then you can always use std::u16string or std::u32string .

And here's a second answer - about Microsoft's (lack of) standards compilance with regard to wchar_t - because, thanks to the standards committee hedging their bets, the situation with this is more confusing than it needs to be.

Just to be clear, wchar_t on Windows is only 16-bits wide and as we all know, there are many more Unicode characters than that these days, so, on the face of it, Windows is non-compliant (albeit, as we again all know, they do what they do for a reason).

So, moving on, I am indebted to Bo Persson for digging up this (emphasis mine):

The Standard says in [basic.fundamental]/5 :

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales . Type wchar_t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t , respectively, in <cstdint> , called the underlying types.

Hmmm. "Among the supported locales." What's that all about?

Well, I for one don't know, and nor, I suspect, is the person that wrote it. It's just been put in there to let Microsoft off the hook, simple as that. It's just double-speak.

As others have commented here (in effect), the standard is a mess. Someone should put something about this in there that other human beings can understand.

The c++ standard defines wchar_t as a type which will support any code point. On linux this is true. MSVC violates the standard and defines it as a 16-bit integer, which is too small.

Therefore the only portable way to handle strings is to convert them from native strings to utf-8 on input and from utf-8 to native strings at the point of output.

You will of course need to use some #ifdef magic to select the correct conversion and I/O calls depending on the OS.

Non-adherence to standards is the reason we can't have nice things.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM