简体   繁体   中英

C difference between *(unsigned char *)s1 and (unsigned char)*s1

I have an assignment to re-write some popular C function which are available in the libc.

I'm writing strcmp and when I was done and happy with it I went to check the one in the libc.

This is mine :

int     ft_strcmp(const char *s1, const char *s2)
{
    while (*s1 && *s1 == *s2)
    {
        s1++;
        s2++;
    }
    return ((unsigned char)*s1 - (unsigned char)*s2);
}

And this is the one in the libc ( https://www.opensource.apple.com/source/Libc/Libc-262/ppc/gen/strcmp.c ) :

int
strcmp(const char *s1, const char *s2)
{
    for ( ; *s1 == *s2; s1++, s2++)
    if (*s1 == '\0')
        return 0;
    return ((*(unsigned char *)s1 < *(unsigned char *)s2) ? -1 : +1); // HERE ! Why *(unsigned char *) :/ ?
}

I don't understand why that *(unsigned char *)s1 works, I thought it wouldn't, but it really seems to !

And then I found this implementation in another libc ( https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcmp.c;h=a4645638eb685e479b89a5e3912076329cc27773;hb=HEAD )

int
strcmp (p1, p2)
  const char *p1;
  const char *p2;
{
    const unsigned char *s1 = (const unsigned char *) p1;
    const unsigned char *s2 = (const unsigned char *) p2;
    unsigned char c1, c2; 
    do
    {
       c1 = (unsigned char) *s1++;
       c2 = (unsigned char) *s2++;
       if (c1 == '\0')
           return c1 - c2;
     }
     while (c1 == c2); 
     return c1 - c2;
}

Whis is also weird but for other reasons, and this one use what I thought was right (const unsigned char *) p1

(unsigned char *)s1 typecasts s1从const char *s1到a (unsigned char *)s1*(unsigned char *)s1取消引用它以获取值。

you take a char* and dereference it to a char and then cast it to an unsigned char

the one you don't think would work simply casts the pointer to an unsigned char* first, then when it dereferences it's an unsigned char .

in this case because it's just going from a char to unsigned char , there's basically no difference.

if however the original pointer was to an int or something, yours would get an int and cast it to unsigned char . the other one would get the first byte of an int and return it as an unsigned char

The difference between (unsigned char)*s1 and *(unsigned char*)s1 is in how data is loaded from the position that s1 points to:

  • (unsigned char)*s1 reads a value of the type s1 points to, then converts that value to an unsigned char . This variant cannot invoke undefined behaviour.

    If s1 were a double* , a double would be read (that is, 8 bytes will be loaded from memory), and its value would be converted to an unsigned char

  • *(unsigned char*)s1 first changes what the pointer is supposed to point to, then reads the first byte at the location s1 points at. Under certain conditions, this is undefined behaviour with the newer standards, your case does not invoke undefined behaviour, though.

    If s1 were a double* again, the resulting code will load the bit pattern in the first byte in which the double is stored (that is, only one byte will be loaded). This will be something entirely different then the logical value of the double.


Asside

Concerning the possibility of undefined behaviour, the rules are roughly as follows:

  • Casting a pointer to something that is "close enough" is fine. This includes casts changing constness and signedness.

  • Casts to char* types are a special case, they never invoke undefined behaviour. (Thanks to Jens Gustedt for pointing this out.)

So we have the following cases:

  • Casting an int* to a const unsigned int* is fine.

  • Casting an int* to a char* is fine.

  • Casting a double* to an uint64* to analyse the bit pattern of the double is undefined behaviour, and allows your compiler to insert code formatting your harddrive.

Short answer: char and unsigned char are similar enough that they can be interpreted in the same way.

Long answer: The C standard is specific enough that it guarantees that both char and unsigned char will have a size of 1 byte, and store their "value bits" in the same format. So up to values of 127, the behavior of this function is strictly defined.

It only gets messy when you get to the sign bit. The C standard allows the sign bit to represent one's complement, two's complement, or signed magnitude, depending on the implementation. So on a platform using two's complement (which is the most common by far), -1 would be represented as 11111111 , and would equal 255 when interpreted as an unsigned char . But using signed magnitude, it would be represented as 10000001 and equal 129 when interpreted as an unsigned char .

In the latter case, this is DIFFERENT than what would get by explicitly casting to unsigned char (the (unsigned char) s1++ example):

if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.

So the C standard guarantees that if you explicitly cast -1 to an unsigned char , the value 256 will be added, making the result of the cast 255 . So if you were on a platform using signed magnitude:

    char c = -1;
    unsigned char u1 = (unsigned char)c; // this results in 255
    unsigned char u2 = *(unsigned char *)&c; // this results in 129!

I imagine that these discrepancies are so uncommon that no one notices them though. C implementations that don't use 2's complement are few and far between.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM