strlen 性能实现

Question

This is a multipurpose question:这是一个多用途问题：

How does this compare to the glibc strlen implementation?这与glibc strlen实现相比如何？
Is there a better way to to this in general and for autovectorization.有没有更好的方法来解决这个问题和自动矢量化。

#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <stdbool.h>
#include <stdlib.h>

/* Todo: Document */
#define WORD_ONES_LOW   ((size_t)-1 / UCHAR_MAX)
#define WORD_ONES_HIGH  (((size_t)-1 / UCHAR_MAX) << (CHAR_BIT - 1))

/*@doc
 * @desc: see if an arch word has a zero
 * #param: w - string aligned to word size
 */
static inline bool word_has_zero(const size_t *w)
{
    return ((*w - WORD_ONES_LOW) & ~*w & WORD_ONES_HIGH);
}

/*@doc
 * @desc: see POSIX strlen()
 * @param: s - string
 */
size_t strlen(const char *s)
{
    const char *z = s;

    /* Align to word size */
    for (; ((uintptr_t)s & (sizeof(size_t) - 1)) && *s != '\0'; s++);

    if (*s != '\0') {
        const size_t *w;

        for (w = (const size_t *)s; !word_has_zero(w); w++);
        for (s = (const char *)w; *s != '\0'; s++);
    }

    return (s - z);
}

Answer 1

Well, this implementation is based on virtually the same trick ( Determine if a word has a zero byte ) as the glibc implementation you linked.嗯，这个实现基于与您链接的 glibc 实现几乎相同的技巧（确定一个字是否具有零字节）。 They do pretty much the same thing, except that in glibc version some loops are unrolled and bit masks are spelled out explicitly.他们几乎做同样的事情，除了在 glibc 版本中一些循环被展开并且位掩码被明确地拼写出来。 The ONES and HIGHS from the code you posted is exactly himagic = 0x80808080L and lomagic = 0x01010101L form glibc version.您发布的代码中的ONES和HIGHS正是himagic = 0x80808080L和lomagic = 0x01010101L形成 glibc 版本。

The only difference I see is that glibs version uses a slightly different criterion for detecting a zero byte我看到的唯一区别是 glibs 版本使用稍微不同的标准来检测零字节

if ((longword - lomagic) & himagic)

without doing ... & ~longword (compare to HASZERO(x) macro in your example, which does the same thing with x , but also includes ~(x) member).不做... & ~longword HASZERO(x)与HASZERO(x)宏相比，它与x做同样的事情，但还包括~(x)成员）。 Apparently glibc authors believed this shorter formula is more efficient.显然 glibc 的作者认为这个较短的公式更有效。 Yet it can result in false positives.然而，它可能导致误报。 So they check for false positives under that if .所以他们在if下检查误报。

It is indeed an interesting question, what is more efficient: a single-stage precise test (your code) or a two-stage test that begins with rough imprecise check followed, if necessary, by a precise second check (glibc code).这确实是一个有趣的问题，哪个更有效：单阶段精确测试（您的代码）或两阶段测试，首先进行粗略的不精确检查，然后在必要时进行精确的第二次检查（glibc 代码）。

If you want to see how they compare in terms of actual performance - time them on your platform and your data.如果您想查看它们在实际性能方面的比较 - 在您的平台和数据上对它们进行计时。 There's no other way.没有别的办法。

Answer 2

Also, please note this implementation can read past the end of a char array here:另外，请注意，此实现可以在此处读取字符数组的末尾：

for (w = (const void *)s; !HASZERO(*w); w++);

and therefore relies on undefined behaviour.因此依赖于未定义的行为。

Answer 3

To answer your second question, I think the naive byte-based strlen implementation will result in better auto-vectorization by the compiler, if it's smart and support for vector instruction set extensions (eg SSE) has been enabled (eg with -msse or an appropriate -march ).为了回答你的第二个问题，我认为基于字节的原始strlen实现将导致编译器更好的自动向量化，如果它很聪明并且支持向量指令集扩展（例如 SSE）已经启用（例如使用-msse或适当的- -march ）。 Unfortunately, it won't result in any vectorization with baseline cpus which lack these features, even though the compiler could generate 32- or 64-bit pseudo-vectorized code like the C code cited in the question, if it were smart enough...不幸的是，即使编译器可以生成 32 位或 64 位伪矢量化代码，如问题中引用的 C 代码，如果它足够聪明，它也不会导致使用缺乏这些功能的基线 CPU 进行任何矢量化。 .

strlen 性能实现

问题描述

3 个解决方案

解决方案1
7 已采纳 2012-08-03 00:52:14

解决方案2
3 2012-08-08 02:05:33

解决方案3
0 2012-08-03 01:21:25

strlen 性能实现

问题描述

3 个解决方案

解决方案1 7 已采纳 2012-08-03 00:52:14

解决方案2 3 2012-08-08 02:05:33

解决方案3 0 2012-08-03 01:21:25

解决方案1
7 已采纳 2012-08-03 00:52:14

解决方案2
3 2012-08-08 02:05:33

解决方案3
0 2012-08-03 01:21:25