简体   繁体   中英

qsort function compare confused me

I see lots of people use subtraction in a qsort comparator function. I think it is wrong because when dealing with these numbers: int nums[]={-2147483648,1,2,3}; INT_MIN = -2147483648; int nums[]={-2147483648,1,2,3}; INT_MIN = -2147483648;

int compare (const void * a, const void * b)
{
  return ( *(int*)a - *(int*)b );
}

I wrote this function to test:

#include <stdio.h>
#include <limits.h>

int compare (const void * a, const void * b)
{
    return ( *(int*)a - *(int*)b );
}

int main(void)
{
    int a = 1;
    int b = INT_MIN;
    printf("%d %d\n", a,b);
    printf("%d\n",compare((void *)&a,(void *)&b));
    return 0;
}

The output is:

1 -2147483648
-2147483647

but a > b so the output should be positive。 I have seen many books write like this. I think it is wrong; it should be written like this when dealing with int types:

int compare (const void * a, const void * b)
{
    if(*(int *)a < *(int *)b)
        return -1;
    else if(*(int *)a > *(int *)b)
        return 1;
    else 
        return 0;
}

I just cannot figure out why many books and web sites write in such a misleading way. If you have any different view, please let me know.

I think it is wrong

Yes, a simple subtraction can lead to int overflow which is undefined behavior and should be avoided.

return *(int*)a - *(int*)b;  // Potential undefined behavior.

A common idiom is to subtract two integer compares. Various compilers recognize this and create efficient well behaved code. Preserving const -ness also is good form.

const int *ca = a;
const int *cb = b;
return (*ca > *cb) - (*ca < *cb);

why many books and web sites write in such a misleading way.

return *a - *b; is conceptually easy to digest - even if it provides the wrong answer with extreme values - often learner code omits edge conditions to get the idea across - "knowing" that values will never be large .

Or consider the complexities of comparing long doubles with regard to NaN .

Your understanding is absolutely correct. This common idiom cannot be used for int values.

Your proposed solution works correctly, although it would be more readable with local variables to avoid so many casts:

int compare(const void *a, const void *b) {
    const int *aa = a;
    const int *bb = b;
    if (*aa < *bb)
        return -1;
    else if (*aa > *bb)
        return 1;
    else 
        return 0;
}

Note that modern compilers will generate the same code with or without these local variables: always prefer the more readable form.

A more compact solution with the same exact result is commonly used although a bit more difficult to understand:

int compare(const void *a, const void *b) {
    const int *aa = a;
    const int *bb = b;
    return (*aa > *bb) - (*aa < *bb);
}

Note that this approach works for all numeric types, but will return 0 for NaN floating point values.

As for your remark: I just cannot figure out why many books and web sites write in such a misleading way :

  • Many books and websites contain mistakes, and so do most programs. Many programming bugs get caught and squashed before they reach production if the program is tested wisely. Code fragments in books are not tested, and although they never reach production , the bugs they contain do propagate virally via unsuspecting readers who learn bogus methods and idioms. A very bad and lasting side effect.

  • Kudos to you for catching this! You have a rare skill among programmers: you are a good reader. There are far more programmers who write code than programmers who can read code correctly and see mistakes. Hone this skill by reading other people's code, on stack overflow or from open source projects... And do report the bugs.

  • The subtraction method is in common use, I have seen it in many places like you and it does work for most value pairs. This bug may go unnoticed for eons. A similar problem was latent in the zlib for decades: int m = (a + b) / 2; causes a fateful integer overflow for large int values of a and b .

  • The author probably saw it used and thought the subtraction was cool and fast, worth showing in print.

  • Note however that the erroneous function does work correctly for types smaller than int : signed or unsigned char and short , if these types are indeed smaller than int on the target platform, which the C Standard does not mandate.

  • Indeed similar code can be found in The C Programming Language by Brian Kernighan and Dennis Ritchie, the famous K&R C bible by its inventors. They use this approach in a simplistic implementation of strcmp() in chapter 5. The code in the book is dated, going all the way back to the late seventies. Although it has implementation defined behavior, it does not invoke undefined behavior in any but the rarest architectures among which the infamous DeathStation-9000 , yet it should not be used to compare int values.

You are correct, *(int*)a - *(int*)b poses a risk of integer overflow and ought to be avoided as a method of comparing two int values.

It is possible it could be valid code in a controlled situation where one knows the values are such that the subtraction will not overflow. In general, though, it should be avoided.

The reason why so many books are wrong is likely the root of all evil: the K&R book. In chapter 5.5 they try to teach how to implement strcmp :

int strcmp(char *s, char *t)
{
  int i;
  for (i = 0; s[i] == t[i]; i++)
    if (s[i] == '\0')
      return 0;
  return s[i] - t[i];
}

This code is questionable since char has implementation-defined signedness. Ignoring that, and ignoring that they fail to use const correctness as in the standard C version, the code otherwise works, partially because it relies on implicit type promotion to int (which is ugly), partially since they assume 7 bit ASCII, and the worst case 0 - 127 cannot underflow.

Further down in the book, 5.11, they try to teach how to use qsort :

qsort((void**) lineptr, 0, nlines-1,
  (int (*)(void*,void*))(numeric ? numcmp : strcmp));

Ignoring the fact that this code invokes undefined behavior, since strcmp is not compatible with the function pointer int (*)(void*, void*) , they teach to use the above method from strcmp .

However, looking at their numcmp function, it looks like this:

/* numcmp: compare s1 and s2 numerically */
int numcmp(char *s1, char *s2)
{
  double v1, v2;
  v1 = atof(s1);
  v2 = atof(s2);
  if (v1 < v2)
    return -1;
  else if (v1 > v2)
    return 1;
  else
    return 0;
}

Ignoring the fact that this code will crash and burn if an invalid character is found by atof (such as the very likely locale issue with . versus , ), they actually manage to teach the correct method of writing such a comparison function. Since this function uses floating point, there's really no other way to write it.

Now someone might want to come up with an int version of this. If they do it based on the strcmp implementation rather than the floating point implementation, they'll get bugs.

Overall, just by flipping a few pages in this once canonical book, we already found some 3-4 cases of reliance on undefined behavior and 1 case of reliance on implementation-defined behavior. So it is really no wonder if people who learn C from this book writes code full of undefined behavior.

First, it's of course correct that an integer during the comparison could create serious problems for you.

On the other hand, doing a single subtraction is cheaper than going through an if/then/else, and the comparison gets performed O(n^2) times in a quicksort, so if this sort is performance-critical and we can get away with it we may want to use the difference.

It will work fine so long as all the values are in some range of size less than 2^31, because then their differences have to be smaller. So if whatever is generating the list you want to sort is going to keep values between a billion and minus one billion then you're fine using subtraction.

Note that checking that the values are in such a range prior to the sort is an O(n) operation.

On the other hand if there's a chance that the overflow could happen, you'd want to use something like the code you wrote in your question

Note that lots of stuff you see doesn't explicitly take overflow into account; it's just that maybe that's more expected in something that's more obviously an "arithmetic" context.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM