简体   繁体   English

在C ++中优化指针副本

[英]Optimizing pointer copies in c++

So today I was trying to optimize linked list traversal today. 因此,今天我正在尝试优化链表遍历。 My thought was the it's less efficient to copy cur to last and then next to cur, when I could just do one copy. 我的想法是,当我只能复制一份时,复制cur到最后再复制到cur的效率较低。 Hopefully the code below helps make it clearer: 希望下面的代码可以使它更清晰:

struct Node{
    int body;
    Node* next;
};

Node* construct(int len){
    Node *head, *ptr, *end;
    head = new Node();
    ptr = head;
    ptr->body = 0;
    for(int i=1; i<len; i++){
        end = new Node();
        end->next = NULL;
        end->body = i;

        ptr->next = end;
        ptr = end;
    }
    return head;
}

int len(Node* ptr){
    int i=1;
    while(ptr->next){
        ptr = ptr->next;
        i += 1;
    }
    return i;
}

void trim(Node* head){
    Node *last, *cur;
    cur = head;
    while(cur->next){
        last = cur;
        cur = cur->next;
    }
    last->next = NULL;
}

void tumble_trim(Node* head){ // This one uses less copies per traverse
    Node *a, *b;
    a = head;
    while(true){
        if(!a->next){
            b->next = NULL;
            break;
        }
        b = a->next;
        if(!b->next){
            a->next = NULL;
            break;
        }
        a = b->next;
    }
}

int main(){
    int start;
    Node *head;

    start = clock();
    head = construct(100000);
    for(int i=0; i<5000; i++){
        trim(head);
    }
    cout << clock()-start << endl;

    start = clock();
    head = construct(100000);
    for(int i=0; i<5000; i++){
        tumble_trim(head);
    }
    cout << clock()-start << endl;
}

The results however were quite surprising to me. 但是结果令我非常惊讶。 In fact the one with less copies was slower: 实际上,副本较少的副本速度较慢:

1950000
2310000 // I expected this one to be faster

Can anyone explain why the tumble_trim() function is so slow? 谁能解释为什么tumble_trim()函数这么慢?

You compiler is obviously optimising trim() much more than it can tumble_trim() . 您的编译器显然比tumble_trim()更能优化trim() tumble_trim() It's a prime example of keeping your code simple and readable and only trying any optimisation after you've identified a bottleneck through performance analysis. 这是保持代码简单易读, 在通过性能分析确定瓶颈后才尝试进行任何优化的主要示例。 And even then you'll be hard pressed to beat the compiler on a simple loop like this. 即使那样,您也将很难在这样的简单循环中击败编译器。

Here's the relevant parts of the generated assembly for the two functions: (just the while loops: 这是两个函数生成的程序集的相关部分:(仅while循环:

trim: 修剪:

LBB2_1:                                 ## =>This Inner Loop Header: Depth=1
    movq    %rcx, %rax
    movq    %rdi, %rcx
    movq    8(%rdi), %rdi
    testq   %rdi, %rdi
    jne LBB2_1
## BB#2:

tumbletrim: tumbletrim:

LBB3_1:                                 ## =>This Inner Loop Header: Depth=1
    movq    %rdi, %rax
    movq    8(%rax), %rdx
    testq   %rdx, %rdx
    je  LBB3_2
## BB#3:                                ##   in Loop: Header=BB3_1 Depth=1
    movq    8(%rdx), %rdi
    testq   %rdi, %rdi
    movq    %rdx, %rcx
    jne LBB3_1
## BB#4:
    movq    $0, 8(%rax)
    popq    %rbp
    ret
LBB3_2:

Now, let's try to describe what happens in each: 现在,让我们尝试描述每个事件:

In trim, the following steps are performed: 在修剪中,执行以下步骤:

  1. copy 3 pointer-sized values 复制3个指针大小的值
  2. test the condition for the while loop 测试while循环的条件
  3. if the condition is satisfied, jump to the beginning of the loop 如果满足条件,则跳到循环的开始

In other words, each iteration contains 3 copies, 1 test and 1 jump instruction. 换句话说,每个迭代包含3个副本,1个测试和1个跳转指令。

Now, your clever optimized tumbletrim: 现在,您巧妙地优化了tumbletrim:

  1. copy 2 pointer-sized values 复制2个指针大小的值
  2. test the condition for the break 测试休息条件
  3. if the condition is satisfied, jump to the end of the loop 如果满足条件,则跳到循环结束
  4. else copy a pointer-sized value 否则复制指针大小的值
  5. test the condition for the while loop 测试while循环的条件
  6. copy a pointer-sized value 复制指针大小的值
  7. jump to the beginning of the loop 跳到循环的开始

In other words, in the final iteration, when you exit the loop, the total number of instructions executed is: 换句话说,在最后的迭代中,当您退出循环时,执行的指令总数为:

  • trim: 3 pointer copies, 1 compare 修剪:3个指针副本,1个比较
  • tumbletrim: 2 pointer, 1 compare, 1 jump tumbletrim:2个指针,1个比较,1个跳转

In all other iterations, the total count looks as follows: 在所有其他迭代中,总计数如下:

  • trim: 3 pointer copies, 1 compare, 1 jump 修剪:3个指针副本,1个比较,1个跳转
  • tumbletrim: 4 pointer copies, 2 compares, 1 jump tumbletrim:4个指针副本,2个比较,1个跳转

So in the rare case (the last iteration before exiting the loop), your implementation is cheaper if and only if a jump instruction is cheaper than copying a pointer-sized value from register to register (which it is not) 因此,在极少数情况下(退出循环之前的最后一次迭代), 当且仅当跳转指令比在寄存器之间复制指针大小的值便宜(不是)时,您的实现便宜

In the common case (all other iterations, your implementation has more copies and more compares. (And more instructions, putting more load on the instruction cache. And more branch statements, putting more load on the branch cache) 在常见情况下(所有其他迭代,您的实现具有更多的副本更多的比较。(更多的指令,给指令高速缓存带来更多的负载。更多的分支语句,向分支缓存带来更多的负载)

Now, if you're at all concerned about performance in the first place , then there are two much more fundamental things you're doing wrong: 现在,如果你在所有关心摆在首位的表现,那么有两个你做错了更为基本的东西:

  1. you are using a linked list. 您正在使用链接列表。 Linked lists are slow because of the algorithm they perform (which involves jumping around in memory, because the nodes are not allocated contiguously), and not because of the implementation. 链接列表的执行速度很慢,这是因为它们执行的算法(这涉及到在内存中跳转,因为节点没有连续分配),而不是因为实现。 So no matter how clever your implementation is, it would not compensate for the underlying algorithm being terrible 因此,无论您的实现多么聪明,它都无法弥补底层算法的糟糕性
  2. you are writing your own linked list. 您正在编写自己的链接列表。 If you absolutely must use a linked list, use the one that was written by experts ( std::list ) 如果绝对必须使用链表,请使用专家撰写的std::liststd::list

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM