简体   繁体   English

为什么多个 if 语句比执行 while 循环更快?

[英]Why are multiple if statements faster than executing a while loop?

My program's input is a large string, around 30,000 characters.我的程序的输入是一个大字符串,大约有 30,000 个字符。 Below is the code for my own strlen:下面是我自己的 strlen 代码:

size_t  strlen(const char *c)
{
    int i;

    i = 0;
    while (c[i] != '\0')
        i++;
    return (i);
}

The version of strlen above takes ~2.1 seconds to execute.上面的 strlen 版本需要大约 2.1 秒才能执行。 Through a different version, I was able to achieve ~1.4 seconds.通过不同的版本,我能够达到 ~1.4 秒。

My question is, why are multiple if statements faster than executing a while loop?我的问题是,为什么多个 if 语句比执行 while 循环更快?

size_t  strlen(const char *str)
{
    const char  *start;

    start = str;
    while (1)
    {
        if (str[0] == '\0')
            return (str - start);
        if (str[1] == '\0')
            return (str - start + 1);
        if (str[2] == '\0')
            return (str - start + 2);
        if (str[3] == '\0')
            return (str - start + 3);
        if (str[4] == '\0')
            return (str - start + 4);
        if (str[5] == '\0')
            return (str - start + 5);
        if (str[6] == '\0')
            return (str - start + 6);
        if (str[7] == '\0')
            return (str - start + 7);
        if (str[8] == '\0')
            return (str - start + 8);
        str += 9; // 
    }
}

My question is, why, that alot of if statements, is faster then still running a loop?我的问题是,为什么那么多 if 语句比仍然运行循环要快?

Edit: With stantard lib, is something around 1.25 secs.编辑:使用标准库,大约需要 1.25 秒。

Your question is pertinent, but your benchmark is incomplete and has surprising results.您的问题是相关的,但您的基准测试不完整并且结果令人惊讶。

Here is a modified and instrumented version of your code:这是您的代码的修改和检测版本:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <fcntl.h>
#include <unistd.h>

#define VERSION     3
#define TRIALS      100
#define ITERATIONS  100

#if VERSION == 1

size_t strlen1(const char *c) {
    size_t i;

    i = 0;
    while (c[i] != '\0')
        i++;
    return (i);
}
#define strlen(s)  strlen1(s)

#elif VERSION == 2

size_t strlen2(const char *str) {
    const char  *start;

    start = str;
    while (1) {
        if (str[0] == '\0')
            return (str - start);
        if (str[1] == '\0')
            return (str - start + 1);
        if (str[2] == '\0')
            return (str - start + 2);
        if (str[3] == '\0')
            return (str - start + 3);
        if (str[4] == '\0')
            return (str - start + 4);
        if (str[5] == '\0')
            return (str - start + 5);
        if (str[6] == '\0')
            return (str - start + 6);
        if (str[7] == '\0')
            return (str - start + 7);
        if (str[8] == '\0')
            return (str - start + 8);
        str += 9;
    }
}
#define strlen(s)  strlen2(s)

#elif VERSION == 3

size_t strlen3(const char *str) {
    const uint64_t *px, sub = 0x0101010101010101, mask = 0x8080808080808080;
    const char *p;

    for (p = str; (uintptr_t)p & 7; p++) {
        if (!*p)
            return p - str;
    }
    for (px = (const uint64_t *)(uintptr_t)p;;) {
        uint64_t x = *px++;
        if (((x - sub) & ~x) & mask)
            break;
    }
    for (p = (const char *)(px - 1); *p; p++)
        continue;
    return p - str;
}
#define strlen(s)  strlen3(s)

#endif

int get_next_line(int fd, char **pp) {
    char buf[32768];
    char *line = NULL, *new_line;
    char *p;
    ssize_t line_size = 0;
    ssize_t nread, chunk;

    while ((nread = read(fd, buf, sizeof buf)) > 0) {
        p = memchr(buf, '\n', nread);
        chunk = (p == NULL) ? nread : p - buf;
        new_line = realloc(line, line_size + chunk + 1);
        if (!new_line) {
            free(line);
            *pp = NULL;
            return 0;
        }
        line = new_line;
        memcpy(line + line_size, buf, chunk);
        line_size += chunk;
        line[line_size] = '\0';
        if (p != NULL) {
            lseek(fd, chunk + 1 - nread, SEEK_CUR);
            break;
        }
    }
    *pp = line;
    return line != NULL;
}

int main() {
    char *line = NULL;
    int fd, fd2, count, trial;
    clock_t min_clock = 0;

    fd = open("one_big_fat_line.txt", O_RDONLY);
    if (fd < 0) {
        printf("cannot open one_big_fat_line.txt\n");
        return 1;
    }

    fd2 = open("output.txt", O_WRONLY | O_CREAT | O_TRUNC, S_IREAD | S_IWRITE);
    if (fd2 < 0) {
        printf("cannot open output.txt\n");
        return 1;
    }

    for (trial = 0; trial < TRIALS; trial++) {
        clock_t t = clock();
        for (count = 0; count < ITERATIONS; count++) {
            lseek(fd, 0L, SEEK_SET);
            lseek(fd2, 0L, SEEK_SET);
            while (get_next_line(fd, &line) == 1) {
                write(fd2, line, strlen(line));
                write(fd2, "\n", 1);
                free(line);
            }
        }
        t = clock() - t;
        if (min_clock == 0 || min_clock > t)
            min_clock = t;
    }
    close(fd);
    close(fd2);

    double time_taken = (double)min_clock / CLOCKS_PER_SEC;
    printf("Version %d time: %.3f microseconds\n", VERSION, time_taken * 1000000 / ITERATIONS);
    return 0;
}

The program opens a file, reads lines from it with a custom function read_next_line() that uses unix system calls and malloc to return arbitrary sized lines.该程序打开一个文件,使用自定义函数read_next_line()从中读取行,该函数使用 unix 系统调用和malloc返回任意大小的行。 It then writes these lines using the unix system call write and appends a newline with a separate system call.然后它使用 unix 系统调用write写入这些行,并使用单独的系统调用附加换行符。

Benchmarking this sequence with your test file, a 30000 byte file with a single line of ASCII characters, shows a very different performance from what you measure: depending on the selected implementation of strlen and the compilation optimisation settings, the time on my laptop range from 15 microseconds to 82 microseconds per iteration, nowhere close to 1 or 2 seconds as you observe.使用您的测试文件对这个序列进行基准测试,一个 30000 字节的文件,带有一行 ASCII 字符,显示出与您测量的非常不同的性能:根据所选的strlen实现和编译优化设置,我的笔记本电脑上的时间范围从每次迭代 15 微秒到 82 微秒,远不及您观察到的 1 或 2 秒。

  • Using the C library default implementation, I get 14.5 microseconds per iteration with or without optimisations.使用 C 库默认实现,无论有没有优化,每次迭代我都会得到 14.5 微秒。

  • Using your strlen1 naive implementation, I get 82 microseconds with optimisations disabled and 25 microseconds with -O3 optimisations.使用您的strlen1 naive 实现,禁用优化时我得到 82 微秒, -O3优化时得到 25 微秒。

  • Using your strlen2 unrolled implementation, the speed improves to 30 microseconds with -O0 and 20 microseconds with -O3 .使用您的strlen2展开实现,速度提高到-O0 30 微秒和-O3 20 微秒。

  • Finally, a more advanced C implementation reading 8 bytes at a time strlen3 provides further improved performance at 21 microseconds with -O0 and 15.5 microseconds with -O3 .最后,更高级的 C 实现一次读取 8 个字节strlen3提供了进一步改进的性能,使用-O0为 21 微秒,使用-O3 15.5 微秒。

Note how compiler optimisations affect the performance much more than manual optimisations.请注意编译器优化对性能的影响比手动优化要大得多。

The reason your unrolled version performs better is the generated code increments the pointer once per byte and an unconditional jump is performed once per byte, whereas the unrolled version reduces these to once every 9 bytes.展开版本性能更好的原因是生成的代码每字节增加一次指针,并且每字节执行一次无条件跳转,而展开版本将这些减少到每 9 个字节一次。 Note however that the C compiler gets almost the same performance with -O3 on the naive code as what you get unrolling the loop yourself.但是请注意,C 编译器在原始代码上使用-O3获得的性能与您自己展开循环的性能几乎相同。

The advanced version is very close in performance to the C library implementation, which may use assembly language with SIMD instructions.高级版本在性能上非常接近 C 库实现,它可以使用带有 SIMD 指令的汇编语言。 It reads 8 bytes at a time and performs an arithmetic trick to detect if any of these bytes has its topmost bit changed from 0 to 1 when subtracting 1 from its value.它一次读取 8 个字节,并执行一个算术技巧来检测当从其值中减去1时,这些字节中的任何一个是否将其最高位从0更改为1 The extra initial steps are required to align the pointer to read 64-bit words, thus avoiding unaligned reads that have undefined behavior on some architectures.需要额外的初始步骤来对齐指针以读取 64 位字,从而避免在某些架构上具有未定义行为的未对齐读取。 It also assumes that memory protection is not available at the byte level.它还假设内存保护在字节级别不可用。 On modern x86 systems, memory protection has a 4K or larger granularity, but some other systems such as Windows 2.x the protection was much finer grained, preventing this optimisation altogether.在现代 x86 系统上,内存保护的粒度为 4K 或更大,但其他一些系统(如 Windows 2.x)的保护粒度要细得多,完全阻止了这种优化。

Note however that the benchmark also measures the time to read from the input file, locate the newline and write to the output file.但是请注意,基准测试还测量从输入文件读取、定位换行符和写入输出文件的时间。 The relative performance of strlen and strlen3 are probably much more significant. strlenstrlen3的相对性能可能要重要得多。 Indeed a separate benchmark for just strlen(line) with your 30000 byte line shows a time of 2.2 microseconds for strlen3() and 0.85 microseconds for strlen() .实际上,仅针对strlen(line)和 30000 字节行进行的单独基准测试显示, strlen3()的时间为 2.2 微秒,而strlen()的时间为 0.85 微秒。

Conclusions:结论:

  • benchmarking is a tricky game.基准测试是一个棘手的游戏。
  • compilers are good at optimizing when told to do so, -O3 is a good default.编译器在被告知这样做时擅长优化, -O3是一个很好的默认值。
  • redefining library functions to try and optimise them is futile and risky.重新定义库函数以尝试优化它们是徒劳且有风险的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么int(len(str(k)))比循环快(while) - Why int(len(str(k))) is faster than a loop (while) 为什么Go中的简单循环比C中的循环快? - Why is this simple loop faster in Go than in C? 为什么BSS中静态数组上的第二个循环比第一个循环快? - Why is the second loop over a static array in the BSS faster than the first? 为什么 memcmp 比 for 循环检查快这么多? - Why is memcmp so much faster than a for loop check? 在简单的循环测试中,Javascript比Classical C快100,为什么? - Javascript is 100 faster than Classical C in simple for loop test, why? “for(;;)”比“while (TRUE)”快吗? 如果不是,人们为什么要使用它? - Is “for(;;)” faster than “while (TRUE)”? If not, why do people use it? 为什么while循环被执行多次? - Why the while loop is executed more than once? 几个语句while循环 - Several statements while loop 为什么while循环中的条件语句导致程序永远选择性地暂停? - Why are the conditional statements in the while loop causing the program to pause selectively forever? 大型switch语句的宏是否比大型switch语句的函数快? - Are macros for large switch statements faster than functions with large switch statements?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM