Representing any universal character within the range of 0x00 to 0x7F in C++?

Question

I am writing a Lexer in MSVC and I need a way to represent an exact character match for all 128 Basic Latin unicode characters.
However, according to this MSDN article , "With the exception of 0x24 and 0x40, characters in the range of 0 to 0x20 and 0x7f to 0x9f cannot be represented with a universal character name (UCN)."

...Which basically means I am not allowed to declare something like wchar_t c = '\'; , let alone use a switch statement on this 'disallowed' range of characters. Also, for '\\n' and '\\r', it is to my understanding that the actual values/lengths vary between compilers/target OS's...
(ie Windows uses '\\r\\n', while Unix simply uses '\\n' and older versions of MacOS use '\\r')
...and so I have made a workaround for this using universal characters in order to ensure proper encoding schemes and byte lengths are detected and properly used.

But this C3850 compiler error simply refuses to allow me to do things my way...
So how can this be solved in a manner that ensures proper encoding schemes & character matches given ANY source input ?

Answer 1

In C++11 the restrictions on what characters you may represent with universal character names do not apply inside character and string literals.

C++11 2.3/2

Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence , s-char-sequence , or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed.15

That means that those restrictions on UCNs don't apply inside character and string literals:

wchar_t c = L'\u0000'; // perfectly okay

switch(c) {
    case L'\u0000':
        ;
}

This was different in C++03 and I assume from your question that Microsoft has not yet updated their compiler to allow this. However I don't think this matters because using UCNs does not solve the problem you're trying to solve.

and so I have made a workaround for this using universal characters in order to ensure proper encoding schemes and byte lengths are detected and properly used

Using UCNs does not do anything to determine the encoding scheme used. A UCN is a source encoding independent method of including a particular character in your source, but the compiler is required to treat it exactly the same as if that character had been written literally in the source.

For example, take the code:

int main() {
    unsigned char c = 'µ';
    std::cout << (int)c << '\n';
}

If you save the source as UTF-16 and build this with Microsoft's compiler on a Windows system configured to use code page 1252 then the compiler will convert the UTF-16 representation of 'µ' to the CP1252 representation. If you build this source on a system configured with a different code page, one that does not contain the character, then the compiler will give a warning/error when it fails to convert the character to that code page.

Similarly, if you save the source code as UTF-8 (with the so-called 'BOM', so that the compiler knows the encoding is UTF-8) then it will convert the UTF-8 source representation of the character to the system's code page if possible, whatever that is.

And if you replace 'µ' with a UCN, '\µ', the compiler will still do exactly the same thing; it will convert the UCN to the system's code page representation of U+00B5 MICRO SIGN, if possible.

So how can this be solved in a manner that ensures proper encoding schemes & character matches given ANY source input?

I'm not sure what you're asking. I'm guessing you want to ensure that the integral values of char or wchar_t variables/literals are consistent with a certain encoding scheme (probably ASCII since you're only asking about characters in the ASCII range), but what is the 'source input'? The encoding of your lexer's source files? The encoding of the input to your lexer? How do you expect the 'source input' to vary?

Also, for '\\n' and '\\r', it is to my understanding that the actual values/lengths vary between compilers/target OS's... (ie Windows uses '\\r\\n', while Unix simply uses '\\n' and older versions of MacOS use '\\r')

This is a misunderstanding of text mode I/O. When you write the character '\\n' to a text mode file the OS can replace the '\\n' character with some platform specific representation of a new line. However this does not mean that the actual value of '\\n' is any different. The change is made purely within the library for writing files.

For example you can open a file in text mode, write '\\n', then open the file in binary mode and compare the written data to '\\n', and the written data can differ from '\\n':

#include <fstream>
#include <iostream>

int main() {
    char const * filename = "test.txt";
    {
        std::ofstream fout(filename);
        fout << '\n';
    }
    {
        std::ifstream fin(filename, std::ios::binary);
        char buf[100] = {};
        fin.read(buf, sizeof(buf));
        if (sizeof('\n') == fin.gcount() && buf[0] == '\n') {
            std::cout << "text mode written '\\n' matches value of '\\n'\n";
        } else {
            // This will be executed on Windows
            std::cout << "text mode written '\\n' does not match value of '\\n'\n";
        }
    }
}

This also doesn't depend on using the '\\n' syntax; you can rewrite the above using 0xA , the ASCII newline character, and the results will be the same on Windows. (Ie, when you write the byte 0xA to a text mode file Windows will actually write the two bytes 0xD 0xA .)

Answer 2

I found that omitting the string literal and simply using the hexadecimal value of the character allows everything to compile just fine.

For example, you would change the following line:

wchar_t c = L'\u0000';

...to:

wchar_t c = 0x0000;

Though, I'm still not sure if this actually holds the same independent values provided by a UCN.

Representing any universal character within the range of 0x00 to 0x7F in C++?

Question

2 answers

solution1
3 ACCPTED 2013-03-10 17:24:44

solution2
0 2013-03-10 18:52:45

Representing any universal character within the range of 0x00 to 0x7F in C++?

Question

2 answers

solution1 3 ACCPTED 2013-03-10 17:24:44

solution2 0 2013-03-10 18:52:45

solution1
3 ACCPTED 2013-03-10 17:24:44

solution2
0 2013-03-10 18:52:45