[英]Why are multiple if statements faster than executing a while loop?
My program's input is a large string, around 30,000 characters.我的程序的输入是一个大字符串,大约有 30,000 个字符。 Below is the code for my own strlen:
下面是我自己的 strlen 代码:
size_t strlen(const char *c)
{
int i;
i = 0;
while (c[i] != '\0')
i++;
return (i);
}
The version of strlen above takes ~2.1 seconds to execute.上面的 strlen 版本需要大约 2.1 秒才能执行。 Through a different version, I was able to achieve ~1.4 seconds.
通过不同的版本,我能够达到 ~1.4 秒。
My question is, why are multiple if statements faster than executing a while loop?我的问题是,为什么多个 if 语句比执行 while 循环更快?
size_t strlen(const char *str)
{
const char *start;
start = str;
while (1)
{
if (str[0] == '\0')
return (str - start);
if (str[1] == '\0')
return (str - start + 1);
if (str[2] == '\0')
return (str - start + 2);
if (str[3] == '\0')
return (str - start + 3);
if (str[4] == '\0')
return (str - start + 4);
if (str[5] == '\0')
return (str - start + 5);
if (str[6] == '\0')
return (str - start + 6);
if (str[7] == '\0')
return (str - start + 7);
if (str[8] == '\0')
return (str - start + 8);
str += 9; //
}
}
My question is, why, that alot of if statements, is faster then still running a loop?我的问题是,为什么那么多 if 语句比仍然运行循环要快?
Edit: With stantard lib, is something around 1.25 secs.编辑:使用标准库,大约需要 1.25 秒。
Your question is pertinent, but your benchmark is incomplete and has surprising results.您的问题是相关的,但您的基准测试不完整并且结果令人惊讶。
Here is a modified and instrumented version of your code:这是您的代码的修改和检测版本:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <fcntl.h>
#include <unistd.h>
#define VERSION 3
#define TRIALS 100
#define ITERATIONS 100
#if VERSION == 1
size_t strlen1(const char *c) {
size_t i;
i = 0;
while (c[i] != '\0')
i++;
return (i);
}
#define strlen(s) strlen1(s)
#elif VERSION == 2
size_t strlen2(const char *str) {
const char *start;
start = str;
while (1) {
if (str[0] == '\0')
return (str - start);
if (str[1] == '\0')
return (str - start + 1);
if (str[2] == '\0')
return (str - start + 2);
if (str[3] == '\0')
return (str - start + 3);
if (str[4] == '\0')
return (str - start + 4);
if (str[5] == '\0')
return (str - start + 5);
if (str[6] == '\0')
return (str - start + 6);
if (str[7] == '\0')
return (str - start + 7);
if (str[8] == '\0')
return (str - start + 8);
str += 9;
}
}
#define strlen(s) strlen2(s)
#elif VERSION == 3
size_t strlen3(const char *str) {
const uint64_t *px, sub = 0x0101010101010101, mask = 0x8080808080808080;
const char *p;
for (p = str; (uintptr_t)p & 7; p++) {
if (!*p)
return p - str;
}
for (px = (const uint64_t *)(uintptr_t)p;;) {
uint64_t x = *px++;
if (((x - sub) & ~x) & mask)
break;
}
for (p = (const char *)(px - 1); *p; p++)
continue;
return p - str;
}
#define strlen(s) strlen3(s)
#endif
int get_next_line(int fd, char **pp) {
char buf[32768];
char *line = NULL, *new_line;
char *p;
ssize_t line_size = 0;
ssize_t nread, chunk;
while ((nread = read(fd, buf, sizeof buf)) > 0) {
p = memchr(buf, '\n', nread);
chunk = (p == NULL) ? nread : p - buf;
new_line = realloc(line, line_size + chunk + 1);
if (!new_line) {
free(line);
*pp = NULL;
return 0;
}
line = new_line;
memcpy(line + line_size, buf, chunk);
line_size += chunk;
line[line_size] = '\0';
if (p != NULL) {
lseek(fd, chunk + 1 - nread, SEEK_CUR);
break;
}
}
*pp = line;
return line != NULL;
}
int main() {
char *line = NULL;
int fd, fd2, count, trial;
clock_t min_clock = 0;
fd = open("one_big_fat_line.txt", O_RDONLY);
if (fd < 0) {
printf("cannot open one_big_fat_line.txt\n");
return 1;
}
fd2 = open("output.txt", O_WRONLY | O_CREAT | O_TRUNC, S_IREAD | S_IWRITE);
if (fd2 < 0) {
printf("cannot open output.txt\n");
return 1;
}
for (trial = 0; trial < TRIALS; trial++) {
clock_t t = clock();
for (count = 0; count < ITERATIONS; count++) {
lseek(fd, 0L, SEEK_SET);
lseek(fd2, 0L, SEEK_SET);
while (get_next_line(fd, &line) == 1) {
write(fd2, line, strlen(line));
write(fd2, "\n", 1);
free(line);
}
}
t = clock() - t;
if (min_clock == 0 || min_clock > t)
min_clock = t;
}
close(fd);
close(fd2);
double time_taken = (double)min_clock / CLOCKS_PER_SEC;
printf("Version %d time: %.3f microseconds\n", VERSION, time_taken * 1000000 / ITERATIONS);
return 0;
}
The program opens a file, reads lines from it with a custom function read_next_line()
that uses unix system calls and malloc
to return arbitrary sized lines.该程序打开一个文件,使用自定义函数
read_next_line()
从中读取行,该函数使用 unix 系统调用和malloc
返回任意大小的行。 It then writes these lines using the unix system call write
and appends a newline with a separate system call.然后它使用 unix 系统调用
write
写入这些行,并使用单独的系统调用附加换行符。
Benchmarking this sequence with your test file, a 30000 byte file with a single line of ASCII characters, shows a very different performance from what you measure: depending on the selected implementation of strlen
and the compilation optimisation settings, the time on my laptop range from 15 microseconds to 82 microseconds per iteration, nowhere close to 1 or 2 seconds as you observe.使用您的测试文件对这个序列进行基准测试,一个 30000 字节的文件,带有一行 ASCII 字符,显示出与您测量的非常不同的性能:根据所选的
strlen
实现和编译优化设置,我的笔记本电脑上的时间范围从每次迭代 15 微秒到 82 微秒,远不及您观察到的 1 或 2 秒。
Using the C library default implementation, I get 14.5 microseconds per iteration with or without optimisations.使用 C 库默认实现,无论有没有优化,每次迭代我都会得到 14.5 微秒。
Using your strlen1
naive implementation, I get 82 microseconds with optimisations disabled and 25 microseconds with -O3
optimisations.使用您的
strlen1
naive 实现,禁用优化时我得到 82 微秒, -O3
优化时得到 25 微秒。
Using your strlen2
unrolled implementation, the speed improves to 30 microseconds with -O0
and 20 microseconds with -O3
.使用您的
strlen2
展开实现,速度提高到-O0
30 微秒和-O3
20 微秒。
Finally, a more advanced C implementation reading 8 bytes at a time strlen3
provides further improved performance at 21 microseconds with -O0
and 15.5 microseconds with -O3
.最后,更高级的 C 实现一次读取 8 个字节
strlen3
提供了进一步改进的性能,使用-O0
为 21 微秒,使用-O3
15.5 微秒。
Note how compiler optimisations affect the performance much more than manual optimisations.请注意编译器优化对性能的影响比手动优化要大得多。
The reason your unrolled version performs better is the generated code increments the pointer once per byte and an unconditional jump is performed once per byte, whereas the unrolled version reduces these to once every 9 bytes.展开版本性能更好的原因是生成的代码每字节增加一次指针,并且每字节执行一次无条件跳转,而展开版本将这些减少到每 9 个字节一次。 Note however that the C compiler gets almost the same performance with
-O3
on the naive code as what you get unrolling the loop yourself.但是请注意,C 编译器在原始代码上使用
-O3
获得的性能与您自己展开循环的性能几乎相同。
The advanced version is very close in performance to the C library implementation, which may use assembly language with SIMD instructions.高级版本在性能上非常接近 C 库实现,它可以使用带有 SIMD 指令的汇编语言。 It reads 8 bytes at a time and performs an arithmetic trick to detect if any of these bytes has its topmost bit changed from
0
to 1
when subtracting 1
from its value.它一次读取 8 个字节,并执行一个算术技巧来检测当从其值中减去
1
时,这些字节中的任何一个是否将其最高位从0
更改为1
。 The extra initial steps are required to align the pointer to read 64-bit words, thus avoiding unaligned reads that have undefined behavior on some architectures.需要额外的初始步骤来对齐指针以读取 64 位字,从而避免在某些架构上具有未定义行为的未对齐读取。 It also assumes that memory protection is not available at the byte level.
它还假设内存保护在字节级别不可用。 On modern x86 systems, memory protection has a 4K or larger granularity, but some other systems such as Windows 2.x the protection was much finer grained, preventing this optimisation altogether.
在现代 x86 系统上,内存保护的粒度为 4K 或更大,但其他一些系统(如 Windows 2.x)的保护粒度要细得多,完全阻止了这种优化。
Note however that the benchmark also measures the time to read from the input file, locate the newline and write to the output file.但是请注意,基准测试还测量从输入文件读取、定位换行符和写入输出文件的时间。 The relative performance of
strlen
and strlen3
are probably much more significant. strlen
和strlen3
的相对性能可能要重要得多。 Indeed a separate benchmark for just strlen(line)
with your 30000 byte line shows a time of 2.2 microseconds for strlen3()
and 0.85 microseconds for strlen()
.实际上,仅针对
strlen(line)
和 30000 字节行进行的单独基准测试显示, strlen3()
的时间为 2.2 微秒,而strlen()
的时间为 0.85 微秒。
Conclusions:结论:
-O3
is a good default.-O3
是一个很好的默认值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.