简体   繁体   中英

Preparing for char8_t in C++ 17

I'm using Microsoft Visual C++ 16.1 (2019 Community) and am trying to write code which will be "proper" in C++ 2020 which is expected to have a char8_t type which will be an unsigned char. I define a type like this:

using char8_t = unsigned char;

Code such as the following:

std:string data;
const char8_t* ptr = data.c_str ();

does not compile as it will not convert the signed char pointer to an unsigned char pointer without a reinterpret_cast. Is there something I can do to prepare for 2020 without having reinterpret casts all over the place?

Thanks for the comments. The comments and further research has corrected a major misconception which prompted the original question. I now understand that a 2020 char8_t is not a UTF-8 character and a 2020 u8string is not a UTF-8 string. While they may be used in a "UTF-8 string" implementation, they are not such.

Thus, it appears use of reinterpret_cast 's is unavoidable, but can be hidden/isolated to a set of inline function overloads (or a set of function templates). Implementation of a utf8string object (perhaps as a template) as a distinct object is necessary (if such is not already available soemewhere).

P1423 (char8_t backward compatibility remediation) documents a number of approaches that can be used to remediate the backward compatibility impact due to the adoption of char8_t via P0482 (char8_t: A type for UTF-8 characters and strings) .

Because char8_t is a non-aliasing type, it is undefined behavior to use reinterpret_cast to, for example, assign a char8_t pointer to a pointer to char as in reinterpret_cast<const char8_t*>(data.c_str()) . However, because char and unsigned char are allowed to alias any type, it is permissible to use reinterpret_cast in the other direction, eg, reinterpret_cast<const char*>(u8"text") .

None of the remediation approaches documented in P1423 are silver bullets. You'll need to evaluate what works best for your use cases. You might also appreciate the answers in C++20 with u8, char8_t and std::string .

With regard to char8_t not being a UTF-8 character and u8string not being a UTF-8 string, that is correct in that, char8_t is a code unit type (not a code point type) and that u8string does not enforce well-formed UTF-8 sequences. However, the intent is very much that these types only be used for UTF-8 data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM