简体   繁体   English

strncpy / memcpy / memmove是逐字节还是以其他方式有效地复制数据?

[英]Do strncpy/memcpy/memmove copy the data byte by byte or in another efficiently way?

As we know, in a multi-bytes word computer such as x86/x86_64, it is more efficiently to copy/move a big bulk of memory word by word (4 or 8 bytes per step), than to do so byte by byte. 众所周知,在x86 / x86_64这样的多字节字计算机中,逐字复制/移动大量内存(每步4或8个字节)比逐字节更有效。

I'm curious about which way would strncpy/memcpy/memmove do things in, and how do they deal with memory word alignment. 我很好奇strncpy / memcpy / memmove会做什么,以及它们如何处理内存字对齐。

char buf_A[8], buf_B[8];

// I often want to code as this
*(double*)buf_A = *(double*)buf_B;

//in stead of this
strcpy(buf_A, buf_B);
// but it worsen the readability of my codes.

In general, you don't have to think too much about how memcpy or other similar functions are implemented. 通常,您不必过多考虑如何实现memcpy或其他类似功能。 You should assume they are efficient unless your profiling proves you wrong. 除非你的分析证明你错了,否则你应该认为它们是有效的。

In practice it indeed is optimized nicely. 在实践中,它确实很好地优化了。 See eg the following test code: 请参阅以下测试代码:

#include <cstring>

void test(char (&a)[8], char (&b)[8])
{
    std::memcpy(&a,&b,sizeof a);
}

Compiling it with g++ 7.3.0 with the command g++ test.cpp -O3 -S -masm=intel we can see the following assembly code: 使用g ++ g++ test.cpp -O3 -S -masm=intel命令用g ++ 7.3.0编译它,我们可以看到以下汇编代码:

test(char (&) [8], char (&) [8]):

    mov     rax, QWORD PTR [rsi]
    mov     QWORD PTR [rdi], rax
    ret

As you can see, the copy is not only inlined, but also collapsed into a single 8-byte read and write. 如您所见,副本不仅内联,而且还折叠为单个8字节读写。

In this case you may prefer to use memcpy as this is the equivalent of *(double*)buf_A = *(double*)buf_B; 在这种情况下,您可能更喜欢使用memcpy因为这相当于*(double*)buf_A = *(double*)buf_B; without undefined behavior. 没有未定义的行为。

You should not worry about calling memcpy because by default the compiler supposes that a call to memcpy has the meaning defined in the c library. 你不应该担心调用memcpy因为默认情况下编译器假设对memcpy的调用具有c库中定义的含义。 So depending on the type of the argument and or the knowledge of the size of the copy at compilation-time, the compiler may choose to not call the c library function and inline a more adapted memory copy strategy. 因此,根据参数的类型和/或编译时副本大小的知识,编译器可以选择不调用c库函数并内联更适合的内存复制策略。 On gcc you can disable this behavior with the -fno-builtin compiler option: demo . 在gcc上,您可以使用-fno-builtin编译器选项禁用此行为: demo

The replacement of memcpy call by the compiler is desired because memcpy will check the size and alignment of the pointers to use the most efficient memory copy strategy (It may start to copy as small blocks as char by char to very large blocks using AVX512 instruction for example). 需要编译器替换memcpy调用,因为memcpy将检查指针的大小和对齐以使用最有效的内存复制策略(它可能开始像char一样通过char复制到非常大的块,使用AVX512指令例)。 These checks and whatsoever the call to memcpy cost. 这些检查以及对memcpy的调用成本。

Also If you are looking for efficiency, you should be concerned about memory alignment. 此外,如果您正在寻找效率,您应该关注内存对齐。 So you may want to declare the alignment of your buffer: 所以你可能想要声明缓冲区的对齐方式:

alignas(8) char buf_A[8];

From cpp-reference : 来自cpp-reference

Copies count bytes from the object pointed to by src to the object pointed to by dest. 复制从src指向的对象到dest指向的对象的计数字节。 Both objects are reinterpreted as arrays of unsigned char. 这两个对象都被重新解释为unsigned char数组。

NOTES 笔记

std::memcpy is meant to be the fastest library routine for memory-to-memory copy. std :: memcpy意味着内存到内存复制的最快库例程。 It is usually more efficient than std::strcpy, which must scan the data it copies or std::memmove, which must take precautions to handle overlapping inputs. 它通常比std :: strcpy更有效,它必须扫描它复制的数据或std :: memmove,它必须采取预防措施来处理重叠输入。

Several C++ compilers transform suitable memory-copying loops to std::memcpy calls. 几个C ++编译器将合适的内存复制循环转换为std :: memcpy调用。

Where strict aliasing prohibits examining the same memory as values of two different types, std::memcpy may be used to convert the values. 如果严格别名禁止检查与两种不同类型的值相同的内存,则std :: memcpy可用于转换值。

So it should be the quickest way to copy data. 所以它应该是复制数据的最快方式。 Be aware however, that there are several cases where the behavior is undefined: 但请注意,有几种情况下行为未定义:

If the objects overlap, the behavior is undefined. 如果对象重叠,则行为未定义。

If either dest or src is a null pointer, the behavior is undefined, even if count is zero. 如果dest或src是空指针,则行为是未定义的,即使count为零也是如此。

If the objects are potentially-overlapping or not TriviallyCopyable, the behavior of memcpy is not specified and may be undefined. 如果对象可能重叠或不是TriviallyCopyable,则不指定memcpy的行为,并且可能未定义。

Does strcpy/strncpy copy the data byte by byte or in another efficiently way? strcpy / strncpy是逐字节还是以其他方式有效地复制数据?

The C++ nor C standard don't specify how strcpy/strncpy are implemented exactly. C ++和C标准没有具体说明strcpy / strncpy的实现方式。 They only describe the behaviour. 他们只描述行为。

There are multiple standard library implementations and each choose how to implement their functions. 有多个标准库实现,每个都选择如何实现它们的功能。 It is possible to implement both of those using memcpy. 可以使用memcpy实现这两者。 The standards don't exactly describe the implementation of memcpy either, and the existence of multiple implementations apply to it just as well. 标准也没有准确描述memcpy的实现,并且多个实现的存在也适用于它。

memcpy can be implemented taking advantage of full word copy. memcpy可以利用全文复制实现。 A short pseudocode of how memcpy could be implemented: 的的memcpy如何可以实现短的伪代码:

if len >= 2 * word size
    copy bytes until destination pointer is aligned to word boundary
    if len >= page size
        copy entire pages using virtual address manipulation
    copy entire words
 copy the trailing bytes that are not aligned to word boundary

To find out how a particular standard library implementation implements strcpy/strncpy/memcpy, you can read the source code of the standard library - if you have access to it. 要了解特定标准库实现如何实现strcpy / strncpy / memcpy,您可以阅读标准库的源代码 - 如果您有权访问它。

Even further, when the length is known at compile time, the compiler might even choose to not use the library memcpy, but instead do the copy inline. 更进一步,当在编译时知道长度时,编译器甚至可能选择不使用库memcpy,而是进行内联复制。 Whether your compiler has built in definitions for standard library functions, you can find out in the documentation of the respective compiler. 无论您的编译器是否具有标准库函数的内置定义,您都可以在相应编译器的文档中找到。

It depends on the compiler you are using and C run-time library you are using. 这取决于您使用的编译器和您正在使用的C运行时库。 In most cases string.h functions like memcmp , memcpy , strcpu , memset etc implemented using assembly in the CPU optimized way. 在大多数情况下,string.h函数如memcmpmemcpystrcpumemset等在CPU优化方式下使用汇编实现。

You can found the GNU libc implementations of those functions for the AMD64 arhitecture . 您可以为AMD64架构找到这些函数的GNU libc实现。 As you can see it may use SSE or AVX instructions to copy 128 and 512 bits per iteration. 如您所见,它可以使用SSE或AVX指令每次迭代复制128位和512位。 Microsoft also bundle source code of their CRT together with Visual Studio (the same approaches mostly, MMX, SSE, AVX loops supported). Microsoft还将其CRT的源代码与Visual Studio捆绑在一起(主要是相同的方法,支持MMX,SSE,AVX循环)。

Also compiler uses special optimization for such functions, GCC call them builtins other compiler call them intrinsic. 编译器也对这类函数使用特殊优化,GCC调用它们内置其他编译器调用它们的内在函数。 Ie compiler may choose - call a library function, or generate CPU specific assembly code optimal for the current context. 即编译器可以选择 - 调用库函数,或生成针对当前上下文的最佳CPU特定汇编代码。 For example, when N argument of memcpy is constant ie memcpy(dst, src, 128) compiler may generate inline assembly code (something like mov 16,rcx cls rep stosq ), and when it is a variable ie memcpy(dst,src,bytes) - compiler may insert call to library function (something like call _memcpy ) 例如,当memcpy N参数是常量时,即memcpy(dst, src, 128)编译器可能会生成内联汇编代码(类似于mov 16,rcx cls rep stosq ),当它是变量即memcpy(dst,src,bytes) - 编译器可以插入对库函数的call _memcpy (类似于call _memcpy

I think all of the opinions and advices on this page are reasonable, but I decide to try a little experiment. 我认为这个页面上的所有意见和建议都是合理的,但我决定尝试一些实验。

To my surprise, the fastest method isn't the one we expected theoretically. 令我惊讶的是,最快的方法不是我们理论上预期的方法。

I tried some code as following. 我尝试了一些代码如下。

#include <cstring>
#include <iostream>
#include <string>
#include <chrono>

using std::string;
using std::chrono::system_clock;

inline void mycopy( double* a, double* b, size_t s ) {
   while ( s > 0 ) {
      *a++ = *b++;
      --s;
   }
};

// to make sure that every bits have been changed
bool assertAllTrue( unsigned char* a, size_t s ) {
   unsigned char v = 0xFF;
   while ( s > 0 ) {
      v &= *a++;
      --s;
   }
   return v == 0xFF;
};

int main( int argc, char** argv ) {
   alignas( 16 ) char bufA[512], bufB[512];
   memset( bufB, 0xFF, 512 );  // to prevent strncpy from stoping prematurely
   system_clock::time_point startT;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      strncpy( bufA, bufB, sizeof( bufA ) );
   std::cout << "strncpy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      memcpy( bufA, bufB, sizeof( bufA ) );
   std::cout << "memcpy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      memmove( bufA, bufB, sizeof( bufA ) );
   std::cout << "memmove:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      mycopy( ( double* )bufA, ( double* )bufB, sizeof( bufA ) / sizeof( double ) );
   std::cout << "mycopy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   return EXIT_SUCCESS;
}

The result (one of many similar results): 结果(许多类似结果之一):

strncpy:52840919, AllTrue:true strncpy:52840919,AllTrue:true

memcpy:57630499, AllTrue:true memcpy:57630499,AllTrue:true

memmove:57536472, AllTrue:true memmove:57536472,AllTrue:true

mycopy:57577863, AllTrue:true mycopy:57577863,AllTrue:true

It looks like: 看起来像:

  1. memcpy, memmove, and my own method have similar result; memcpy,memmove和我自己的方法有类似的结果;
  2. What does strncpy do magic, so that it is the best one even faster than memcpy? strncpy做什么魔法,所以它比memcpy更快?

Is it funny? 有趣吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM