简体   繁体   English

在C ++中慢慢写入数组

[英]Slow writing to array in C++

I was just wondering if this is expected behavior in C++. 我只是想知道这是否是C ++中的预期行为。 The code below runs at around 0.001 ms: 下面的代码运行大约0.001毫秒:

for(int l=0;l<100000;l++){
        int total=0;
        for( int i = 0; i < num_elements; i++) 
        {
            total+=i;
        }
    }

However if the results are written to an array, the time of execution shoots up to 15 ms: 但是,如果将结果写入数组,则执行时间最多为15毫秒:

int *values=(int*)malloc(sizeof(int)*100000);
        for(int l=0;l<100000;l++){
            int total=0;
            for( unsigned int i = 0; i < num_elements; i++) 
            {
                total+=i;
            }
            values[l]=total;
        }

I can appreciate that writing to the array takes time but is the time proportionate? 我可以理解写入数组需要时间,但时间是否成比例?

Cheers everyone 大家欢呼

The first example can be implemented using just CPU registers. 第一个例子可以仅使用CPU寄存器来实现。 Those can be accessed billions of times per second. 这些可以每秒访问数十亿次。 The second example uses so much memory that it certainly overflows L1 and possibly L2 cache (depending on CPU model). 第二个例子使用了如此多的内存,它确实溢出了L1和可能的L2缓存(取决于CPU模型)。 That will be slower. 那会慢一些。 Still, 15 ms/100.000 writes comes out to 1.5 ns per write - 667 Mhz effectively. 然而,15 ms / 100.000写入每次写入1.5 ns - 667 Mhz有效。 That's not slow. 不是很慢。

It looks like the compiler is optimizing that loop out entirely in the first case. 看起来编译器在第一种情况下完全优化了循环。

The total effect of the loop is a no-op, so the compiler just removes it. 循环的总效果是无操作,因此编译器只是删除它。

It's very simple. 这很简单。 In first case You have just 3 variables, which can be easily stored in GPR (general purpose registers), but it doesn't mean that they are there all the time, but they are probably in L1 cache memory, which means thah they can be accessed very fast. 在第一种情况下,您只有3个变量,可以很容易地存储在GPR(通用寄存器)中,但并不意味着它们一直存在,但它们可能在L1高速缓存中,这意味着它们可以访问速度非常快。

In second case You have more than 100k variables, and You need about 400kB to store them. 在第二种情况下,您有超过100k的变量,并且您需要大约400kB来存储它们。 That is deffinitely to much for registers and L1 cache memory. 这对于寄存器和L1高速缓冲存储器来说是非常有用的。 In best case it could be in L2 cache memory, but probably not all of them will be in L2. 在最好的情况下,它可能在L2缓存中,但可能并非所有这些都在L2中。 If something is not in register, L1, L2 (I assume that your processor doesn't have L3) it means that You need to search for it in RAM and it takes muuuuuch more time. 如果没有注册的东西,L1,L2(我假设你的处理器没有L3),这意味着你需要在RAM中搜索它,它需要更多的时间。

I would suspect that what you are seeing is an effect of virtual memory and possibly paging. 我怀疑你所看到的是虚拟内存和可能的分页的影响。 The malloc call is going to allocate a decent sized chunk of memory that is probably represented by a number of virtual pages. malloc调用将分配一个相当大的内存块,可能由许多虚拟页面表示。 Each page is linked into process memory separately. 每个页面分别链接到进程内存。

You may also be measuring the cost of calling malloc depending on how you timed the loop. 您也可能正在测量调用malloc的成本,具体取决于您为循环计时的方式。 In either case, the performance is going to be very sensitive to compiler optimization options, threading options, compiler versions, runtime versions, and just about anything else. 在任何一种情况下,性能都会对编译器优化选项,线程选项,编译器版本,运行时版本以及其他任何内容非常敏感。 You cannot safely assume that the cost is linear with the size of the allocation. 您无法安全地假设成本与分配的大小成线性关系。 The only thing that you can do is measure it and figure out how to best optimize once it has been proven to be a problem . 您唯一能做的就是测量它并找出一旦被证明是一个问题就如何进行最佳优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM