简体   繁体   中英

c++ pass value by reference vs by copy of POD

I heard that passing variable by reference is not always faster than passing by value. Passing by reference is faster for big variables but for small one this problem could be tricky.

Passing by value requires time for copy creation but taking value of local variable should be faster. Passing by reference do not waste time for creating variable copy but there is look at pointer and then on required data.

I am aware that this detail is not so important in optimization problem however it was interesting for me to measure it (i know that -O0 is passe for optimization but this code is to simple, after optimization i was not sure what i was measuring)

g++ -std=c++14 -O0 -g3 -DSIZE_OF_DATA_ARRAY=16 main.cpp && ./a.out

g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406

  • SIZE_OF_DATA_ARRAY | copy time[s] | reference time [s]
  • 4 |0.04 |0.045
  • 8 |0.04 |0.46
  • 16|0.04 |0.05
  • 17|0.07 |0.05
  • 24|0.07 |0.05

My questions:

  1. Why time of execution is quite constant for copying vs struct size?

  2. Why there is threshold between 16 and 17 on copying?

Guess: it is connected with cache

My code:

#include <iostream>
#include <vector>
#include <limits>

#include <chrono>
#include <iomanip>
#include <vector>
#include <algorithm>

struct Data {
    double x[SIZE_OF_DATA_ARRAY];
};

double workOnData(Data &data) {
    for (auto i = 0; i < 10; ++i) {
        data.x[0] -= 0.5 * (data.x[0] - 1);
    }
    return data.x[0];
}

void runTestSuite() {
    auto queries = 1000000;
    Data data;
    for (auto i = 0; i < queries; ++i) {
        data.x[0] = i;
        auto val = workOnData(data);
        if (val == -357)
            data.x[0] = 1;
    }
}

int main() {
    std::cout << "sizeof(Data) = " << sizeof(Data) << "\n";

    size_t numberOfTests = 99;
    std::vector<std::chrono::duration<double>> timeMeasurements{numberOfTests};
    std::chrono::time_point<std::chrono::system_clock> startTime, endTime;
    for (auto i = 0; i < numberOfTests; ++i) {
        startTime = std::chrono::system_clock::now();

        runTestSuite();

        endTime = std::chrono::system_clock::now();
        timeMeasurements[i] = endTime - startTime;
    }
    std::sort(timeMeasurements.begin(), timeMeasurements.end());

    std::chrono::system_clock::time_point now =  std::chrono::system_clock::now();
    std::time_t now_c = std::chrono::system_clock::to_time_t(now);

    std::cout << std::put_time(std::localtime(&now_c), "%F %T") 
    << ": median time = " << timeMeasurements[numberOfTests * 0.5].count() << "s\n";

    return 0;
}

Why time of execution is quite constant for copying vs struct size?

The best understanding is from viewing the assembly language to see the instructions that the compiler emitted. Optimizations here would depend on the optimization setting of the compiler and whether you are in release or debug configuration.

Also depends on the processor. For example, some processors may have specialized instructions for copying large blocks of memory. Other processors may copy data in parallel chunks, depending on the size of the structure. Also, some platforms may have hardware assistance, such as a DMA controller.

Alas, sometimes unrolling may be faster than using special instructions or hardware assistance (depends on the data size).

Why there is threshold between 16 and 17 on copying?

The threshold may be between alignment boundaries and non-alignment.

Let's take a 32-bit processor. It likes to access (fetch) 4 bytes at a time. Accessing 24 bytes requires 6 fetches. Access 16 bytes takes 4 fetches.
However, accessing 17, 18, or 19 bytes requires 5 fetches. It may fetch another 4 bytes to get those remainder bytes.

Another scenario is the implementation of the copy function. Some copy functions may use 32 bit copies for the first set of 4 byte quantities, then switch to byte comparing for the remainders. It could switch to byte copying for all the bytes depending on the size of the data. Many possibilities.

The truth for your system lies in debugging the assembly language or function for copying the data.

Cache Hits & Misses
Your performance metrics may be skewed by processor cache operations. If the processor has your data in cache, the loop will be a lot faster. Usually there will be a performance hit for the first access of the data. There may be more time wasted reloading the data cache if your data is too big for the cache or lies outside the domain of the data cache size.

Instruction Cache Issues
A lot of processors have large pipelines and caches for data instructions. When they encounter a branch (such as the end of a for loop), the processor may have to reset the instruction cache and reload from another location in the program. This takes time. You can demonstrate by unrolling the loop in different size chunks and measuring the performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM