I'm using Microsoft Visual C++ 16.1 (2019 Community) and am trying to write code which will be "proper" in C++ 2020 which is expected to have a char8_t type which will be an unsigned char. I define a type like this:
using char8_t = unsigned char;
Code such as the following:
std:string data;
const char8_t* ptr = data.c_str ();
does not compile as it will not convert the signed char pointer to an unsigned char pointer without a reinterpret_cast. Is there something I can do to prepare for 2020 without having reinterpret casts all over the place?
Thanks for the comments. The comments and further research has corrected a major misconception which prompted the original question. I now understand that a 2020 char8_t
is not a UTF-8 character and a 2020 u8string
is not a UTF-8 string. While they may be used in a "UTF-8 string" implementation, they are not such.
Thus, it appears use of reinterpret_cast
's is unavoidable, but can be hidden/isolated to a set of inline function overloads (or a set of function templates). Implementation of a utf8string
object (perhaps as a template) as a distinct object is necessary (if such is not already available soemewhere).
P1423 (char8_t backward compatibility remediation) documents a number of approaches that can be used to remediate the backward compatibility impact due to the adoption of char8_t
via P0482 (char8_t: A type for UTF-8 characters and strings) .
Because char8_t
is a non-aliasing type, it is undefined behavior to use reinterpret_cast
to, for example, assign a char8_t
pointer to a pointer to char
as in reinterpret_cast<const char8_t*>(data.c_str())
. However, because char
and unsigned char
are allowed to alias any type, it is permissible to use reinterpret_cast
in the other direction, eg, reinterpret_cast<const char*>(u8"text")
.
None of the remediation approaches documented in P1423 are silver bullets. You'll need to evaluate what works best for your use cases. You might also appreciate the answers in C++20 with u8, char8_t and std::string .
With regard to char8_t
not being a UTF-8 character and u8string
not being a UTF-8 string, that is correct in that, char8_t
is a code unit type (not a code point type) and that u8string
does not enforce well-formed UTF-8 sequences. However, the intent is very much that these types only be used for UTF-8 data.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.