简体   繁体   English

如何加快crc32计算速度?

[英]How can I speed up crc32 calculation?

I'm trying to write a crc32 implementation on linux that's as fast as possible, as an exercise in learning to optimise C. I've tried my best, but I haven't been able to find many good resources online. 我正在尝试尽可能快地在linux上编写crc32实现,作为学习优化C的练习。我已经尽力了,但我无法在网上找到很多好资源。 I'm not even sure if my buffer size is sensible; 我甚至不确定我的缓冲区大小是否合理; it was chosen by repeated experimentation. 它是通过反复实验选择的。

#include <stdio.h>
#define BUFFSIZE 1048567

const unsigned long int lookupbase = 0xEDB88320;
unsigned long int crctable[256] = {
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
/* LONG LIST OF PRECALCULTED VALUES */
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D};

int main(int argc, char *argv[]){
    register unsigned long int x;
    int i;
    register unsigned char *c, *endbuff;
    unsigned char buff[BUFFSIZE];
    register FILE *thisfile=NULL;
    for (i = 1; i < argc; i++){
        thisfile = fopen(argv[i], "r");
        if (thisfile == NULL) {
            printf("Unable to open ");
        } else {
            x = 0xFFFFFFFF;
            c = &(buff[0]);
            endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
            while (c != endbuff){
                while (c != endbuff){
                    x=(x>>8) ^ crctable[(x&0xFF)^*c];
                    c++;
                }
                c = &(buff[0]);
                endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
            }
            fclose(thisfile);
            x = x ^ 0xFFFFFFFF;
            printf("%0.8X ", x);
        }
        printf("%s\n", argv[i]);
    }
    return 0;
}

Thanks in advance for any suggestions or resources I can read through. 提前感谢您可以阅读的任何建议或资源。

On Linux? 在Linux上? Forget about the register keyword, that's just a suggestion to the compiler and, from my experience with gcc , it's a waste of space. 忘记register关键字,这只是对编译器的一个建议 ,从我对gcc经验来看,这是浪费空间。 gcc is more than capable of figuring that out for itself. gcc是有能力搞清楚了这一点为自己的

I would just make sure you're compiling with the insane optimisation level, -O3 , and check that. 我只是确保你正在使用疯狂的优化级别-O3编译,然后检查一下。 I've seen gcc produce code at that level which took me hours to understand, so sneaky that it was. 我已经看到gcc生成了那个级别的代码,花了我几个小时才能理解,所以偷偷摸摸的是它。

And, on the buffer size, make it as large as you possibly can. 并且,在缓冲区大小上,尽可能大。 Even with buffering, the cost of calling fread is still a cost, so the less you do it, the better. 即使有缓冲,调用fread的成本仍然是成本,所以你做得越少越好。 You would see a huge improvement if you increased the buffer size from 1K to 1M, not so much if you up it from 1M to 2M, but even a small amount of increased performance is an increase. 如果将缓冲区大小从1K增加到1M,您会看到一个巨大的改进,如果将其从1M增加到2M,则会有很大的改进,但即使是少量增加的性能也会增加。 And, 2M isn't the upper bound of what you can use, I'd set it to one or more gigabytes if possible. 而且,2M不是你可以使用的上限,如果可能的话我会把它设置为一个或多个千兆字节

You may then want to put it at file level (rather than inside main ). 然后,您可能希望将其放在文件级别(而不是main文件内)。 At some point, the stack won't be able to hold it. 在某些时候,堆栈将无法容纳它。

As with most optimisations, you can usually trade space for time. 与大多数优化一样,您通常可以将空间换成时间。 Keep in mind that, for small files (less than 1M), you won't see any improvement since there is still only one read no matter how big you make the buffer. 请记住,对于小文件(小于1M),您将看不到任何改进,因为无论您使用多大的缓冲区,仍然只有一个读取。 You may even find a slight slowdown if the loading of the process has to take more time to set up memory. 如果加载过程需要花费更多时间来设置内存,您甚至可能会发现轻微的减速。

But, since this would only be for small files (where the performance isn't a problem anyway), it shouldn't really matter. 但是,因为这只适用于小文件(性能不是问题),它应该不重要。 Large files, where the performance is an issue, should hopefully find an improvement. 大型文件,性能一个问题,应该有希望找到一个改进。

And I know I don't need to tell you this (since you indicate you are doing it), but I will mention it anyway for those who don't know: Measure, don't guess! 而且我知道我不需要告诉这个(因为你表明你正在这样做),但无论如何我会为那些不知道的人提起它: 措施,不要猜! The ground is littered with the corpses of those who optimised with guesswork :-) 地上满是那些用猜测优化的人的尸体:-)

You are not going to be able to speed up the actual arithmetic of the CRC calculation, so the areas you can look at are the overhead of (a) reading the file, and (b) looping. 您无法加速CRC计算的实际算术,因此您可以看到的区域是(a)读取文件和(b)循环的开销。

You're using a pretty large buffer size, which is good (but why is it an odd number?). 你正在使用一个相当大的缓冲区大小,这很好(但为什么它是一个奇数?)。 Using a read(2) system call (assuming you're on a unix-like system) instead of the fread(3) standard library function may save you one copy operation (copying the data from fread's internal buffer into your bufffer). 使用read(2)系统调用(假设您使用类似unix的系统)而不是fread(3)标准库函数可以保存一个copy操作(将数据从fread的内部缓冲区复制到您的缓冲区中)。

For the loop overhead, look into loop unrolling . 对于循环开销,请查看循环展开


Your code also has some redundancies that you might want to eliminate. 您的代码也有一些您可能想要消除的冗余。

  • sizeof (unsigned char) is 1 (by definition in C); sizeof (unsigned char)是1(按照C中的定义); no need to explicitly compute it 无需明确计算它

  • c = &(buff[0]) is exactly equivalent to c = buff c = &(buff[0])完全等同于c = buff

Neither of these changes will improve the performance of the code (assuming a decent compiler), but they will make it clearer and more in accordance with usual C style. 这些变化都不会改善代码的性能(假设一个像样的编译器),但它们会使它更清晰,更符合通常的C风格。

You've asked for three values to be stored in registers, but standard x86 only has four general purpose registers: that's an awful lot of burden to place on the last remaining register, which is one reason why I expect register really only prevents you from ever using &foo to find the address of the variable. 您已经要求将三个值存储在寄存器中,但标准x86只有四个通用寄存器:这对于最后一个剩余的寄存器来说是一个非常大的负担,这也是我希望register真的只能阻止您的一个原因曾经使用&foo来查找变量的地址。 I don't think any modern compiler even uses it as a hint, these days. 我认为现在任何现代编译器都不会将它用作提示。 Feel free to remove all three uses and re-time your application. 随意删除所有三种用途并重新安排您的应用程序。

Since you're reading in huge chunks of the file yourself, you might as well use open(2) and read(2) directly, and remove all the standard IO handling behind the scenes. 由于您自己正在阅读大量文件,因此您可以直接使用open(2)read(2) ,并删除幕后的所有标准IO处理。 Another common approach is to open(2) and mmap(2) the file into memory: let the OS page it in as pages are required. 另一种常见的方法是将文件open(2)mmap(2)到内存中:让操作系统页面作为页面进行页面化是必需的。 This may allow future pages to be optimistically read from disk while you're doing your computation: this is a common access pattern, and one the OS designers have attempted to optimize. 这可能允许在您进行计算时从磁盘上乐观地读取将来的页面:这是一种常见的访问模式,也是操作系统设计者试图优化的模式之一。 (The simple mechanism of mapping the entire file at once does put an upper limit on the size of the files you can handle, probably about 2.5 gigabytes on 32-bit platforms and absolutely huge on 64-bit platforms. Mapping the file in chunks will allow you to handle arbitrary sized files even on 32-bit platforms, but at the cost of loops like you've got now for reading, but for mapping.) (一次映射整个文件的简单机制确实会对您可以处理的文件大小设置上限,在32位平台上可能大约为2.5 GB,在64位平台上可能非常大。以块为单位映射文件允许你处理任意大小的文件,即使在32位平台上也是如此,但代价是你现在已经阅读的循环,但是用于映射。)

As David Gelhar points out, you're using an odd-length buffer -- this might complicate the code path of reading the file into memory. 正如David Gelhar指出的那样,你正在使用一个奇数长度的缓冲区 - 这可能会使将文件读入内存的代码路径变得复杂。 If you want to stick with reading from files into buffers, I suggest using a multiple of 8192 (two pages of memory), as it won't have special cases until the last loop. 如果你想坚持从文件读入缓冲区,我建议使用8192 (两页内存)的倍数,因为在最后一个循环之前它不会有特殊情况。

If you're really into eeking out of the last bit of speed and don't mind drastically increasing the size of your pre-computation table, you can look at the file in 16-bit chunks, rather than just 8-bit chunks. 如果您真的想要从最后一点速度开始,并且不介意大幅增加预计算表的大小,那么您可以以16位块的形式查看文件,而不仅仅是8位块。 Frequently, accessing memory along 16-bit alignment is faster than along 8-bit alignment, and you'd cut the number of iterations through your loop in half, which usually gives a huge speed boost. 通常,沿着16位对齐访问存储器比沿着8位对齐更快,并且通过循环将迭代次数减少一半,这通常会提高速度。 The downside, of course, is increased memory pressure (65k entries, each of 8 bytes, rather than just 256 entries each of 4 bytes), and the much larger table is much less likely to fit entirely in the CPU cache. 当然,缺点是增加了内存压力(65k条目,每个8字节,而不是4个字节中的每个只有256个条目),并且更大的表不太可能完全适合CPU缓存。

And the last optimization idea that crosses my mind is to fork(2) into 2, 3, or 4 processes (or use threading), each of which can compute the crc32 of a portion of the file, and then combine the end results after all processes have completed. 我想到的最后一个优化思路是将fork(2)转换为2,3或4个进程(或使用线程),每个进程都可以计算文件的一部分的crc32,然后在结束后合并最终结果所有流程都已完成。 crc32 may not be computationally intensive enough to actually benefit from trying to use multiple cores out of SMP or multicore computers, and figuring out how to combine partial computations of crc32 may not be feasible -- I haven't looked into it myself :) -- but it might repay the effort, and learning how to write multi-process or multi-threaded software is well worth the effort regardless. crc32的计算密集程度可能不足以实际受益于尝试使用SMP或多核计算机中的多个核心,并且弄清楚如何组合crc32的部分计算可能不可行 - 我自己没有调查过它:) - - 但它可以回报这些努力,并且学习如何编写多进程或多线程软件是值得的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM