在 strlen() 的实现中减去 char*

Question

我正在研究 C 中 strlen() 函数的实现。我需要了解它对我的一项任务的工作。

#define ALIGN (sizeof(size_t))
#define ONES ((size_t)-1/UCHAR_MAX)
#define HIGHS (ONES * (UCHAR_MAX/2+1))
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)

size_t strlen(const char *s)
{
    const char *a = s;
    const size_t *w;
    for (; (uintptr_t)s % ALIGN; s++) if (!*s) return s-a;
    for (w = (const void *)s; !HASZERO(*w); w++);
    for (s = (const void *)w; *s; s++);
    return s-a;
}

我不明白“return sa”语句中 char* 的减法在这里做了什么。

这是 musl 的 strlen 实现。 glibc 的 strlen() 实现也使用了这个 char* 减法。

Answer 1

用注释注释的代码说明：

size_t strlen(const char *s)
{
    const char *a = s;      // store a copy pointing at the start of the original        
    const size_t *w;
    for (; (uintptr_t)s % ALIGN; s++) // in case of misalignment, look for first aligned address
      if (!*s) return s-a; // if we encounter \0 while doing so, return the string length
    for (w = (const void *)s; !HASZERO(*w); w++); // work with word-sized chunks and do lookup
    for (s = (const void *)w; *s; s++); // find the exact location of \0 in the final word
    return s-a; // end minus beginning = length
}

C语言兼容性注意事项：

w = (const void *)s依赖于非标准扩展，而*w调用 undefin 行为。 这是库代码，因此有时可能会使用特定设置（例如-fno-strict-aliasing编译。
sa实际上是ptrdiff_t类型，而不是size_t 。 因此，可能需要强制转换以消除编译器警告。
size_t不一定是实现的最大对齐类型，它可能比这更大。 我相信用于 32 位及以上的最正确的类型是uint_fast32_t 。 编译器/lib 应将此类型设为 32 位或 64 位，具体取决于 32/64 位 CPU 上实际最快的速度。
像这样的库实现有时会读取超出传递字符串末尾的字大小的块。 这假设如果字符串不在对齐的地址上结束，那么无害的填充字节将存在并且可以在那里访问。 C 标准绝不保证这一点（这样做是数组越界访问 UB），但可能由本地实现保证。

在不影响性能的情况下，应该可以将这些代码分解为更具可读性和自文档性的内容。 我们可以解决上述一些问题。 也许沿着（未测试/基准）的路线：

#include <stdint.h>
#include <limits.h>

#define ONES ((uint_fast32_t)-1/UCHAR_MAX)
#define HIGHS (ONES * (UCHAR_MAX/2+1))
#define HASZERO(x) ((x)-ONES & ~(x) & HIGHS)

size_t strlen (const char* s)
{
  const char* begin = s;
  const char* end   = s;

  for (; (uintptr_t)end % _Alignof(uint_fast32_t); end++)
  {
    if (*end == '\0') 
    {
      return (size_t)(end - begin);
    }
  }
  
  const uint_fast32_t* word;
  for (word = (const void*)end; !HASZERO(*word); word++)
  {}
  
  for (end = (const void*)word; end != '\0'; end++)
  {}
  
  return (size_t)(end - begin);
}

Answer 2

假设您有字符串"Hello world" 。 该字符串作为数组存储在计算机内存中，并以特殊的“空”字符（ '\\0' ）终止。

该数组将如下所示：

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 'H' | 'e' | 'l' | 'l' | 'o' | ' ' | 'w' | 'o' | 'l' | 'd' | '\0' |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+

当这个函数被调用时（如在strlen("Hello world") ），那么s将指向数组中的第一个字符。 的初始化a也将使指向数组的第一个字符。

这三个循环修改了s ，使其指向终止的空字符。

如果我们再次显示数组，但现在使用指针，它将是这样的：

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 'H' | 'e' | 'l' | 'l' | 'o' | ' ' | 'w' | 'o' | 'l' | 'd' | '\0' |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
^                                                           ^
|                                                           |
a                                                           s

s - a正在做的是计算两个指针s和a的差值（在数组元素中）。 此差异将为10 ，即字符串的长度（不计算空终止符）。

在 strlen() 的实现中减去 char*

问题描述

2 个解决方案

解决方案1
2 2020-10-14 07:09:45

解决方案2
1 2020-10-14 05:17:54

在 strlen() 的实现中减去 char*

问题描述

2 个解决方案

解决方案1 2 2020-10-14 07:09:45

解决方案2 1 2020-10-14 05:17:54

解决方案1
2 2020-10-14 07:09:45

解决方案2
1 2020-10-14 05:17:54