为什么 `std::copy` 从 char 缓冲区读取一个 int 比 `memcpy` 慢 5 倍（，）？在我的测试程序中？

Question

This is a follow-up to this question where I posted this program:这是我发布这个程序的这个问题的后续：

#include <algorithm>
#include <cstdlib>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <iomanip>
#include <iostream>
#include <vector>
#include <chrono>

class Stopwatch
{
public:
    typedef std::chrono::high_resolution_clock Clock;

    //! Constructor starts the stopwatch
    Stopwatch() : mStart(Clock::now())
    {
    }

    //! Returns elapsed number of seconds in decimal form.
    double elapsed()
    {
        return 1.0 * (Clock::now() - mStart).count() / Clock::period::den;
    }

    Clock::time_point mStart;
};

struct test_cast
{
    int operator()(const char * data) const
    {
        return *((int*)data);
    }
};

struct test_memcpy
{
    int operator()(const char * data) const
    {
        int result;
        memcpy(&result, data, sizeof(result));
        return result;
    }
};

struct test_memmove
{
    int operator()(const char * data) const
    {
        int result;
        memmove(&result, data, sizeof(result));
        return result;
    }
};

struct test_std_copy
{
    int operator()(const char * data) const
    {
        int result;
        std::copy(data, data + sizeof(int), reinterpret_cast<char *>(&result));
        return result;
    }
};

enum
{
    iterations = 2000,
    container_size = 2000
};

//! Returns a list of integers in binary form.
std::vector<char> get_binary_data()
{
    std::vector<char> bytes(sizeof(int) * container_size);
    for (std::vector<int>::size_type i = 0; i != bytes.size(); i += sizeof(int))
    {
        memcpy(&bytes[i], &i, sizeof(i));
    }
    return bytes;
}

template<typename Function>
unsigned benchmark(const Function & function, unsigned & counter)
{
    std::vector<char> binary_data = get_binary_data();
    Stopwatch sw;
    for (unsigned iter = 0; iter != iterations; ++iter)
    {
        for (unsigned i = 0; i != binary_data.size(); i += 4)
        {
            const char * c = reinterpret_cast<const char*>(&binary_data[i]);
            counter += function(c);
        }
    }
    return unsigned(0.5 + 1000.0 * sw.elapsed());
}

int main()
{
    srand(time(0));
    unsigned counter = 0;

    std::cout << "cast:      " << benchmark(test_cast(),     counter) << " ms" << std::endl;
    std::cout << "memcpy:    " << benchmark(test_memcpy(),   counter) << " ms" << std::endl;
    std::cout << "memmove:   " << benchmark(test_memmove(),  counter) << " ms" << std::endl;
    std::cout << "std::copy: " << benchmark(test_std_copy(), counter) << " ms" << std::endl;
    std::cout << "(counter:  " << counter << ")" << std::endl << std::endl;

}

I noticed that for some reason std::copy performs much worse than memcpy.我注意到出于某种原因， std::copy的性能比 memcpy 差得多。 The output looks like this on my Mac using gcc 4.7.在我使用 gcc 4.7 的 Mac 上，输出看起来像这样。

g++ -o test -std=c++0x -O0 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      41 ms
memcpy:    46 ms
memmove:   53 ms
std::copy: 211 ms
(counter:  3838457856)

g++ -o test -std=c++0x -O1 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      8 ms
memcpy:    7 ms
memmove:   8 ms
std::copy: 19 ms
(counter:  3838457856)

g++ -o test -std=c++0x -O2 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      3 ms
memcpy:    2 ms
memmove:   3 ms
std::copy: 27 ms
(counter:  3838457856)

g++ -o test -std=c++0x -O3 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      2 ms
memcpy:    2 ms
memmove:   3 ms
std::copy: 16 ms
(counter:  3838457856)

As you can see, even with -O3 it is up to 5 times (.) slower than memcpy.如您所见，即使使用-O3 ，它也比 memcpy 慢 5 倍 (.)。

The results are similar on Linux. Linux 上的结果相似。

Does anyone know why?有谁知道为什么？

Answer 1

I agree with @rici's comment about developing a more meaningful benchmark so I rewrote your test to benchmark copying of two vectors using memcpy() , memmove() , std::copy() and the std::vector assignment operator:我同意@rici 关于开发更有意义的基准的评论，因此我重写了您的测试以使用memcpy() 、 memmove() 、 std::copy()和std::vector赋值运算符对两个向量进行基准复制：

#include <algorithm>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <cstring>
#include <cassert>

typedef std::vector<int> vector_type;

void test_memcpy(vector_type & destv, vector_type const & srcv)
{
    vector_type::pointer       const dest = destv.data();
    vector_type::const_pointer const src  = srcv.data();

    std::memcpy(dest, src, srcv.size() * sizeof(vector_type::value_type));
}

void test_memmove(vector_type & destv, vector_type const & srcv)
{
    vector_type::pointer       const dest = destv.data();
    vector_type::const_pointer const src  = srcv.data();

    std::memmove(dest, src, srcv.size() * sizeof(vector_type::value_type));
}

void test_std_copy(vector_type & dest, vector_type const & src)
{
    std::copy(src.begin(), src.end(), dest.begin());
}

void test_assignment(vector_type & dest, vector_type const & src)
{
    dest = src;
}

auto
benchmark(std::function<void(vector_type &, vector_type const &)> copy_func)
    ->decltype(std::chrono::milliseconds().count())
{
    std::random_device rd;
    std::mt19937 generator(rd());
    std::uniform_int_distribution<vector_type::value_type> distribution;

    static vector_type::size_type const num_elems = 2000;

    vector_type dest(num_elems);
    vector_type src(num_elems);

    // Fill the source and destination vectors with random data.
    for (vector_type::size_type i = 0; i < num_elems; ++i) {
        src.push_back(distribution(generator));
        dest.push_back(distribution(generator));
    }

    static int const iterations = 50000;

    std::chrono::time_point<std::chrono::system_clock> start, end;

    start = std::chrono::system_clock::now();

    for (int i = 0; i != iterations; ++i)
        copy_func(dest, src);

    end = std::chrono::system_clock::now();

    assert(src == dest);

    return
        std::chrono::duration_cast<std::chrono::milliseconds>(
            end - start).count();
}

int main()
{
    std::cout
        << "memcpy:     " << benchmark(test_memcpy)     << " ms" << std::endl
        << "memmove:    " << benchmark(test_memmove)    << " ms" << std::endl
        << "std::copy:  " << benchmark(test_std_copy)   << " ms" << std::endl
        << "assignment: " << benchmark(test_assignment) << " ms" << std::endl
        << std::endl;
}

I went a little overboard with C++11 just for fun.为了好玩，我对 C++11 有点过分了。

Here are the results I get on my 64 bit Ubuntu box with g++ 4.6.3:以下是我在 64 位 Ubuntu 机器上使用 g++ 4.6.3 得到的结果：

$ g++ -O3 -std=c++0x foo.cpp ; ./a.out 
memcpy:     33 ms
memmove:    33 ms
std::copy:  33 ms
assignment: 34 ms

The results are all quite comparable, I get comparable times in all test cases when I change the integer type, eg to long long , in the vector as well.结果都非常具有可比性，当我在向量中更改整数类型（例如，更改为long long ）时，我在所有测试用例中都获得了可比的时间。

Unless my benchmark rewrite is broken, it looks like your own benchmark isn't performing a valid comparison.除非我的基准重写被破坏，否则您自己的基准似乎没有执行有效的比较。 HTH!喂！

Answer 2

Looks to me like the answer is that gcc can optimize these particular calls to memmove and memcpy, but not std::copy.在我看来，答案是 gcc 可以优化对 memmove 和 memcpy 的这些特定调用，但不能优化 std::copy。 gcc is aware of the semantics of memmove and memcpy, and in this case can take advantage of the fact that the size is known (sizeof(int)) to turn the call into a single mov instruction. gcc 知道 memmove 和 memcpy 的语义，在这种情况下可以利用已知大小 (sizeof(int)) 的事实将调用转换为单个 mov 指令。

std::copy is implemented in terms of memcpy, but apparently the gcc optimizer doesn't manage to figure out that data + sizeof(int) - data is exactly sizeof(int). std::copy 是根据 memcpy 实现的，但显然 gcc 优化器无法弄清楚 data + sizeof(int) - data 正好是 sizeof(int)。 So the benchmark calls memcpy.所以基准调用 memcpy。

I got all that by invoking gcc with -S and flipping quickly through the output;我通过使用-S调用 gcc 并快速浏览输出来获得所有这些； I could easily have gotten it wrong, but what I saw seems consistent with your measurements.我很容易弄错，但我所看到的似乎与你的测量结果一致。

By the way, I think the test is more or less meaningless.顺便说一句，我认为测试或多或少没有意义。 A more plausible real-world test might be creating an actual vector<int> src and an int[N] dst , and then comparing memcpy(dst, src.data(), sizeof(int)*src.size()) with std::copy(src.begin(), src.end(), &dst) .一个更合理的现实世界测试可能是创建一个实际的vector<int> src和一个int[N] dst ，然后将memcpy(dst, src.data(), sizeof(int)*src.size())与std::copy(src.begin(), src.end(), &dst) 。

Answer 3

memcpy and std::copy each have their uses, std::copy should(as pointed out by Cheers below) be as slow as memmove because there is no guarantee the memory regions will overlap. memcpy和std::copy各有用途， std::copy应该（正如下面干杯所指出的那样）和 memmove 一样慢，因为不能保证内存区域会重叠。 This means you can copy non-contiguous regions very easily (as it supports iterators) (think of sparsely allocated structures like linked list etc.... even custom classes/structures that implement iterators).这意味着您可以非常轻松地复制非连续区域（因为它支持迭代器）（想想稀疏分配的结构，如链表等......甚至是实现迭代器的自定义类/结构）。 memcpy only work on contiguous reasons and as such can be heavily optimized. memcpy仅适用于连续的原因，因此可以进行大量优化。

Answer 4

That is not the results I get:那不是我得到的结果：

> g++ -O3 XX.cpp 
> ./a.out
cast:      5 ms
memcpy:    4 ms
std::copy: 3 ms
(counter:  1264720400)

Hardware: 2GHz Intel Core i7
Memory:   8G 1333 MHz DDR3
OS:       Max OS X 10.7.5
Compiler: i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1

On a Linux box I get different results:在 Linux 机器上我得到不同的结果：

> g++ -std=c++0x -O3 XX.cpp 
> ./a.out 
cast:      3 ms
memcpy:    4 ms
std::copy: 21 ms
(counter:  731359744)


Hardware:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Memory:    61363780 kB
OS:        Linux ip-10-58-154-83 3.2.0-29-virtual #46-Ubuntu SMP
Compiler:  g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

Answer 5

According to assembler output of G++ 4.8.1 , test_memcpy :根据G++ 4.8.1的汇编程序输出， test_memcpy ：

movl    (%r15), %r15d

test_std_copy : test_std_copy ：

movl    $4, %edx
movq    %r15, %rsi
leaq    16(%rsp), %rdi
call    memcpy

As you can see, std::copy successfully recognized that it can copy data with memcpy , but for some reason further inlining did not happen - so that is the reason of performance difference.如您所见， std::copy成功地识别出它可以使用memcpy复制数据，但由于某种原因，进一步的内联没有发生——这就是性能差异的原因。

By the way, Clang 3.4 produces identical code for both cases:顺便说一下， Clang 3.4为这两种情况生成了相同的代码：

movl    (%r14,%rbx), %ebp

Answer 6

EDIT : I leave this answer for reference, the odd timings with gcc seem to be an artifact of "code alignment" (see comments)编辑：我留下这个答案以供参考，gcc 的奇怪时间似乎是“代码对齐”的产物（见评论）

I was about to say that this was an implementation glitch in gcc 4 at the time, but it might be more complicated than that.我当时正要说这是 gcc 4 中的一个实现故障，但它可能比那更复杂。 My results are (used 20000/20000 for the counters):我的结果是（计数器使用 20000/20000）：

$ g++ -Ofast a.cpp; ./a.out
cast:      24 ms
memcpy:    47 ms
memmove:   24 ms
std::copy: 24 ms
(counter:  1787289600)

$ g++ -O3 a.cpp; ./a.out
cast:      24 ms
memcpy:    24 ms
memmove:   24 ms
std::copy: 47 ms
(counter:  1787289600)

$ g++ --version
g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008

Notice how copy and memcpy results swap when compiling with -O3 and -Ofast .注意在使用-O3和-Ofast编译时copy和memcpy结果如何交换。 Also memmove is not slower than either. memmove也不比任何一个慢。

In clang the results are simpler:在clang中，结果更简单：

$ clang++ -O3 a.cpp; ./a.out
cast:      26 ms
memcpy:    26 ms
memmove:   26 ms
std::copy: 26 ms
(counter:  1787289600)

$ clang++ -Ofast a.cpp; ./a.out
cast:      26 ms
memcpy:    26 ms
memmove:   26 ms
std::copy: 26 ms
(counter:  1787289600)

$ clang++ --version
clang version 9.0.0-2 (tags/RELEASE_900/final)

perf results: https://pastebin.com/BZCZiAWQ性能结果： https : perf

为什么 `std::copy` 从 char 缓冲区读取一个 int 比 `memcpy` 慢 5 倍（，）？在我的测试程序中？

问题描述

6 个解决方案

解决方案1
10 2012-10-29 23:41:43

解决方案2
8 2012-10-29 21:16:18

解决方案3
3 2012-10-29 19:41:46

解决方案4
3 已采纳 2012-10-29 20:16:46

解决方案5
1 2013-11-10 03:10:19

解决方案6
0 2020-02-20 03:36:59

为什么 `std::copy` 从 char 缓冲区读取一个 int 比 `memcpy` 慢 5 倍（，）？ 在我的测试程序中？

问题描述

6 个解决方案

解决方案1 10 2012-10-29 23:41:43

解决方案2 8 2012-10-29 21:16:18

解决方案3 3 2012-10-29 19:41:46

解决方案4 3 已采纳 2012-10-29 20:16:46

解决方案5 1 2013-11-10 03:10:19

解决方案6 0 2020-02-20 03:36:59

为什么 `std::copy` 从 char 缓冲区读取一个 int 比 `memcpy` 慢 5 倍（，）？在我的测试程序中？

解决方案1
10 2012-10-29 23:41:43

解决方案2
8 2012-10-29 21:16:18

解决方案3
3 2012-10-29 19:41:46

解决方案4
3 已采纳 2012-10-29 20:16:46

解决方案5
1 2013-11-10 03:10:19

解决方案6
0 2020-02-20 03:36:59