How does C uppercase letters?

Question

I see this code in glibc-2.33/ctype/ctype.c :

// [...]

#define __ctype_toupper \
  ((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)

// [...]

int
toupper (int c)
{
  return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}
libc_hidden_def (toupper)

I understand that it's checking if c is within -128 and 256 (inclusive) and returns the character as-is if it's outside that range, but what does _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128) mean and where do I actually find the source code of how letters are uppercased? This seems to be looking up the current locale, I am only interested in en_US.UTF-8 . Also, how can a character be negative?

I don't care about glibc specifically, I just want to know how all the ASCII characters (all as in from NUL to DEL) are uppercased in C.

Answer 1

"C" doesn't convert characters to upper case. The C standard only mandates that there be a function in the standard library which does so correctly according to the current locale, and that it does so in a particular way in the "C" locale (which is the only locale which is guaranteed to exist).

Library implementations are free to accomplish that task as the implementers see fit, and they all do it in different ways. Even radically different ways. Some C libraries don't support locales other than the "C" locale with an ASCII character set. An example of such a C library is musl and it is hard to beat the simplicity of its implementation:

int toupper(int c)
{
        if (islower(c)) return c & 0x5f;
        return c;
}

As you can see, the above code depends on islower . Here it is:

int islower(int c)
{
        return (unsigned)c-'a' < 26;
}

Because of the call to islower , toupper returns unchanged any argument outside of the range of lower case characters, even arguments not in the valid range for toupper. Since the standard doesn't define the behaviour of toupper for arguments outside of the valid range (essentially values which might be returned by fgetc ), just returning invalid arguments unchanged is certainly as acceptable as any other behaviour. Glibc's toupper function will often segfault on invalid arguments, since it uses the argument as an index into an array (as you can see in the code you cite). That behaviour is also acceptable according to the standard.

The Glibc implementation is a lot more complicated. And behind the scenes it depends on the locale data which is compiled from locale definition files, a process which is completely outside of the C standard and somewhat defined by the Posix standard (although the GNU implementation diverges in some way from Posix).

But here's the scoop: If you're using single byte characters in a UTF-8 locale, none of glibc's complicated code makes the slightest difference. The musl implementation works precisely as required in a UTF-8 locale, because the only alphabetic characters representable in a single byte UTF-8 representation are the 52 characters in the "Roman" alphabet. All the other Unicode characters are only representable in wide characters and multibyte sequences.

Furthermore, environments which use a single-byte encoding other than UTF-8 are increasingly rare. There are certainly a lot of us who had to learn this stuff because our programs ran on a variety of platforms which used different ISO-8859-x code pages . Or different single-byte Windows codepages. But in the end, Unicode won out. (And many of us breathed huge sighs of relief.) So most of this apparatus is no longer really necessary except in legacy environments.

But that's not to say that Unicode magically solves all the complications involved in managing the huge variety of alphabets in use in the world. Far from it. What Unicode does do is two-fold: it clarifies what the complications are (most of which is not captured by C/Posix locales), and it provides some basic standards for implementations.

And, as a side effect, UTF-8 standardises single-byte codes to basically conform with the original ASCII 7-bit standard. So if you're only dealing with 7-bit characters (which, these days, is probably less than ideal), you don't need anything beyond musl-style implementations. And if you are dealing with "all the world's character sets", you'll be looking for a library which actually conforms to Unicode, and which uses something other than char to represent characters.

But one complication is going to remain forever, sadly: the fact that C does not standardise the signedness of char . On platforms on which char is signed (Unix X86 and Windows, for two major examples), (char)0xA0 is (a) unspecified and (b) probably -96, which is what a single-byte 0xA0 represents in 2's complement. So if you write code which uses the various functions in ctype.h and don't take care of negative char values, and then you try to use that code with a UTF-8 encoded string which includes characters outside of the single-byte domain, then you will end up passing negative numbers to functions which might not be expecting them.

Answer 2

If you go back at the root and look for _NL_CTYPE_TOUPPER you will find a commit where it is written

[..] (ctype_output): Support for alternate locale format: Computation of nelems changes. _NL_CTYPE_TOUPPER32 [...]

So basically _NL_CTYPE_TOUPPER is the macro for _NL_CTYPE_TOUPPER(8bits) as for example in French you have À as uppercase version of à

Following this link you will find the header file langinfo.h that has this enum starting at line 43 and with _NL_CTYPE_TOUPPER defined at line 259.

LC_CTYPE category: character classification. 256 This information is accessed by the functions in <ctype.h>.

LC_CTYPE is defined for each language, see for example for French :

fr_FR:2000"

Note that it doesn't make a lot of sense to call this function since characters with accent are not contained in the ASCII table, but since this function is the one handling both utf8 and ascii that's how it works.

How does C uppercase letters?

Question

2 answers

solution1
4 ACCPTED 2021-04-07 03:54:18

solution2
0 2021-04-07 02:59:00

How does C uppercase letters?

Question

2 answers

solution1 4 ACCPTED 2021-04-07 03:54:18

solution2 0 2021-04-07 02:59:00

solution1
4 ACCPTED 2021-04-07 03:54:18

solution2
0 2021-04-07 02:59:00