简体   繁体   中英

How do I compare single multibyte character constants cross-platform in C?

In my previous post I found a solution to do this using C++ strings, but I wonder if there would be a solution using char 's in C as well.

My current solution uses str.compare() and size() of a character string as seen in my previous post .

Now, since I only use one (multibyte) character in the std::string , would it be possible to achieve the same using a char ?

For example, if( str[i] == '¶' ) ? How do I achieve that using char 's?

(edit: made a type on SO for comparison operator as pointed out in the comments)

How do I compare single multibyte character constants cross-platform in C?

You seem to mean an integer character constant expressed using a single multibyte character. The first thing to recognize, then, is that in C, integer character constants (examples: 'c' , '¶' ) have type int , not char . The primary relevant section of C17 is paragraph 6.4.4.4/10:

An integer character constant has type int . The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (eg,'ab' ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int .

(Emphasis added.)

Note well that "implementation defined" implies limited portability from the get-go. Even if we rule out implementations defining perverse behavior, we still have alternatives such as

  • the implementation rejects integer character constants containing multibyte source characters; or
  • the implementation rejects integer character constants that do not map to a single-byte execution character; or
  • the implementation maps source multibyte characters via a bytewise identity mapping, regardless of the byte sequence's significance in the execution character set.

That is not an exhaustive list.

You can certainly compare integer character constants with each other, but if they map to multibyte execution characters then you cannot usefully compare them to individual char s.

Inasmuch as your intended application appears to be to locate individual mutlibyte characters in a C string, the most natural thing to do appears to be to implement a C analog of your C++ approach, using the standard strstr() function. Example:

    char str[] = "Some string ¶ some text ¶ to see";
    char char_to_compare[] = "¶";
    int char_size = sizeof(char_to_compare) - 1;  // don't count the string terminator

    for (char *location = strstr(str, char_to_compare);
            location;
            location = strstr(location + char_size, char_to_compare)) {
        puts("Found!");
    }

That will do the right thing in many cases, but it still might be wrong for some characters in some execution character encodings, such as those encodings featuring multiple shift states.

If you want robust handling for characters outside the basic execution character set, then you would be well advised to take control of the in-memory encoding, and to perform appropriate convertions to, operations on, and conversions from that encoding. This is largely what ICU does, for example.

I believe you meant something like this:

char a = '¶';
char b = '¶';

if (a == b) /*do something*/;

The above may or may not work, if the value of '¶' is bigger than the char range, then it will overflow, causing a and b to store a different value than that of '¶'. Regardless of which value they hold, they may actually both have the same value.

Remember, the char type is simply a single-byte wide (8-bits) integer, so in order to work with multibyte characters and avoid overflow you just have to use a wider integer type (short, int, long...) .

short a = '¶';
short b = '¶';

if (a == b) /*do something*/;

From personal experience, I've also noticed, that sometimes your environment may try to use a different character encoding than what you need. For example, trying to print the 'á' character will actually produce something else.

unsigned char x = 'á';
putchar(x); //actually prints character 'ß' in console.
putchar(160); //will print 'á'.

This happens because the console uses an Extended ASCII encoding, while my coding environment actually uses Unicode, parsing a value of 225 for 'á' instead of the value of 160 that I want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM