简体   繁体   中英

Subtraction of char* in implementation of strlen()

I was looking at the implementation of strlen() function in C. I need to understand its working for one of my assignments.

#define ALIGN (sizeof(size_t))
#define ONES ((size_t)-1/UCHAR_MAX)
#define HIGHS (ONES * (UCHAR_MAX/2+1))
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)

size_t strlen(const char *s)
{
    const char *a = s;
    const size_t *w;
    for (; (uintptr_t)s % ALIGN; s++) if (!*s) return s-a;
    for (w = (const void *)s; !HASZERO(*w); w++);
    for (s = (const void *)w; *s; s++);
    return s-a;
}

I do not understand what the subtraction of char* does here in the "return sa" statements.

This is musl's strlen implementation. The glibc's implementation of strlen() also uses this char* subtraction.

Explanation of the code annotated with comments:

size_t strlen(const char *s)
{
    const char *a = s;      // store a copy pointing at the start of the original        
    const size_t *w;
    for (; (uintptr_t)s % ALIGN; s++) // in case of misalignment, look for first aligned address
      if (!*s) return s-a; // if we encounter \0 while doing so, return the string length
    for (w = (const void *)s; !HASZERO(*w); w++); // work with word-sized chunks and do lookup
    for (s = (const void *)w; *s; s++); // find the exact location of \0 in the final word
    return s-a; // end minus beginning = length
}

Notes regarding C language compatibility:

  • w = (const void *)s is relying on non-standard extensions and *w is invoking undefind behavior. This is library code so it may sometimes get compiled with specific settings such as -fno-strict-aliasing .

  • sa is actually of type ptrdiff_t , not size_t . So a cast might be required to silence compiler warnings.

  • size_t is not necessarily the largest aligned type for the implementation, it could be larger than that. I believe the most correct type to use for 32 bit and above would be uint_fast32_t . The compiler/lib should make this type 32 or 64 bits depending on what's actually fastest on the 32/64 bit CPU.

  • Library implementations like this one sometimes read word-sized chunks beyond the end of the passed string. This assuming that in case the string doesn't end on an aligned address, then harmless padding bytes will be present and accessible there. This is by no means guaranteed by the C standard (doing so is array out of bounds access UB), but perhaps by the local implementation.

It should be possible to uncrappify this code into something more readable & self-documenting, without affecting performance. And we can fix some of the above issues while we are at it. Perhaps something along the lines of (not tested/benchmarked):

#include <stdint.h>
#include <limits.h>

#define ONES ((uint_fast32_t)-1/UCHAR_MAX)
#define HIGHS (ONES * (UCHAR_MAX/2+1))
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)

size_t strlen (const char* s)
{
  const char* begin = s;
  const char* end   = s;

  for (; (uintptr_t)end % _Alignof(uint_fast32_t); end++)
  {
    if (*end == '\0') 
    {
      return (size_t)(end - begin);
    }
  }
  
  const uint_fast32_t* word;
  for (word = (const void*)end; !HASZERO(*word); word++)
  {}
  
  for (end = (const void*)word; end != '\0'; end++)
  {}
  
  return (size_t)(end - begin);
}

Lets say you have the string "Hello world" . This string is stored as an array inside the memory of your computer, and terminated by a special "null" character ( '\\0' ).

The array will look something like this:

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 'H' | 'e' | 'l' | 'l' | 'o' | ' ' | 'w' | 'o' | 'l' | 'd' | '\0' |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+

When this function is called (as in strlen("Hello world") ) then s will be pointing to the first character in the array. The initialization of a will also make it point to the first character of the array.

The three loops modifies s so it will point to the terminating null-character.

If we again show the array, but now with the pointers, it will be something like this:

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 'H' | 'e' | 'l' | 'l' | 'o' | ' ' | 'w' | 'o' | 'l' | 'd' | '\0' |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
^                                                           ^
|                                                           |
a                                                           s

What s - a is doing is calculating the difference (in array elements ) of the two pointers s and a . This difference will be 10 which is the length of the string (null-terminator not counted).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM