简体   繁体   中英

how to detect non-ascii characters in C++ Windows?

I'm simply trying detect non-ascii characters in my C++ program on Windows. Using something like isascii() or:

bool is_printable_ascii = (ch & ~0x7f) == 0 && 
                          (isprint() || isspace()) ;

does not work because non-ascii characters are getting mapped to ascii characters before or while getchar() is doing its thing. For example, if I have some code like:

#include <iostream>
using namespace std;
int main()
{
    int c;
    c = getchar();
    cout << isascii(c) << endl;
    cout << c << endl;
    printf("0x%x\n", c);
    cout << (char)c;
    return 0;
}

and input a (because i am so happy right now), the output is

1
63
0x3f
?

Furthermore, if I feed the program something (outside of the extended ascii range (codepage 437)) like 'Ĥ', I get the output to be

1
72
0x48
H

This works with similar inputs such as Ĭ or ō (goes to I and o). So this seems algorithmic and not just mojibake or something. A quick check in python (via same terminal) with a program like

i = input()
print(ord(i))

gives me the expected actual hex code instead of the ascii mapped one (so its not the codepage or the terminal (?)). This makes me believe getchar() or C++ compilers (tested on VS compiler and g++) is doing something funky. I have also tried using cin and many other alternatives. Note that I've tried this on Linux and I cannot reproduce this issue which makes me inclined to believe that it is something to do with Windows (10 pro). Can anyone explain what is going on here?

Try replacing getchar() with getwchar(); I think you're right that its a Windows-only problem.

I think the problem is that getchar(); is expecting input as a char type, which is 8 bits and only supports ASCII. getwchar(); supports the wchar_t type which allows for other text encodings. "" isn't ASCII, and from this page: https://docs.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings , it seems like Windows encodes extended characters like this in UTF-16. I was having trouble finding a lookup table for utf-16 emoji, but I'm guessing that one of the bytes in the utf-16 "" is 0x39 which is why you're seeing that printed out.

Okay, I have solved this. I was not aware of translation modes .

_setmode(_fileno(stdin), _O_WTEXT);

Was the solution. The link below essentially explains that there are translation modes and I think phase 5 (character-set mapping) explains what happened. https://en.cppreference.com/w/cpp/language/translation_phases

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM