Optimizing pointer copies in c++

Question

So today I was trying to optimize linked list traversal today. My thought was the it's less efficient to copy cur to last and then next to cur, when I could just do one copy. Hopefully the code below helps make it clearer:

struct Node{
    int body;
    Node* next;
};

Node* construct(int len){
    Node *head, *ptr, *end;
    head = new Node();
    ptr = head;
    ptr->body = 0;
    for(int i=1; i<len; i++){
        end = new Node();
        end->next = NULL;
        end->body = i;

        ptr->next = end;
        ptr = end;
    }
    return head;
}

int len(Node* ptr){
    int i=1;
    while(ptr->next){
        ptr = ptr->next;
        i += 1;
    }
    return i;
}

void trim(Node* head){
    Node *last, *cur;
    cur = head;
    while(cur->next){
        last = cur;
        cur = cur->next;
    }
    last->next = NULL;
}

void tumble_trim(Node* head){ // This one uses less copies per traverse
    Node *a, *b;
    a = head;
    while(true){
        if(!a->next){
            b->next = NULL;
            break;
        }
        b = a->next;
        if(!b->next){
            a->next = NULL;
            break;
        }
        a = b->next;
    }
}

int main(){
    int start;
    Node *head;

    start = clock();
    head = construct(100000);
    for(int i=0; i<5000; i++){
        trim(head);
    }
    cout << clock()-start << endl;

    start = clock();
    head = construct(100000);
    for(int i=0; i<5000; i++){
        tumble_trim(head);
    }
    cout << clock()-start << endl;
}

The results however were quite surprising to me. In fact the one with less copies was slower:

1950000
2310000 // I expected this one to be faster

Can anyone explain why the tumble_trim() function is so slow?

Answer 1

You compiler is obviously optimising trim() much more than it can tumble_trim() . It's a prime example of keeping your code simple and readable and only trying any optimisation after you've identified a bottleneck through performance analysis. And even then you'll be hard pressed to beat the compiler on a simple loop like this.

Answer 2

Here's the relevant parts of the generated assembly for the two functions: (just the while loops:

trim:

LBB2_1:                                 ## =>This Inner Loop Header: Depth=1
    movq    %rcx, %rax
    movq    %rdi, %rcx
    movq    8(%rdi), %rdi
    testq   %rdi, %rdi
    jne LBB2_1
## BB#2:

tumbletrim:

LBB3_1:                                 ## =>This Inner Loop Header: Depth=1
    movq    %rdi, %rax
    movq    8(%rax), %rdx
    testq   %rdx, %rdx
    je  LBB3_2
## BB#3:                                ##   in Loop: Header=BB3_1 Depth=1
    movq    8(%rdx), %rdi
    testq   %rdi, %rdi
    movq    %rdx, %rcx
    jne LBB3_1
## BB#4:
    movq    $0, 8(%rax)
    popq    %rbp
    ret
LBB3_2:

Now, let's try to describe what happens in each:

In trim, the following steps are performed:

copy 3 pointer-sized values
test the condition for the while loop
if the condition is satisfied, jump to the beginning of the loop

In other words, each iteration contains 3 copies, 1 test and 1 jump instruction.

Now, your clever optimized tumbletrim:

copy 2 pointer-sized values
test the condition for the break
if the condition is satisfied, jump to the end of the loop
else copy a pointer-sized value
test the condition for the while loop
copy a pointer-sized value
jump to the beginning of the loop

In other words, in the final iteration, when you exit the loop, the total number of instructions executed is:

trim: 3 pointer copies, 1 compare
tumbletrim: 2 pointer, 1 compare, 1 jump

In all other iterations, the total count looks as follows:

trim: 3 pointer copies, 1 compare, 1 jump
tumbletrim: 4 pointer copies, 2 compares, 1 jump

So in the rare case (the last iteration before exiting the loop), your implementation is cheaper if and only if a jump instruction is cheaper than copying a pointer-sized value from register to register (which it is not)

In the common case (all other iterations, your implementation has more copies and more compares. (And more instructions, putting more load on the instruction cache. And more branch statements, putting more load on the branch cache)

Now, if you're at all concerned about performance in the first place , then there are two much more fundamental things you're doing wrong:

you are using a linked list. Linked lists are slow because of the algorithm they perform (which involves jumping around in memory, because the nodes are not allocated contiguously), and not because of the implementation. So no matter how clever your implementation is, it would not compensate for the underlying algorithm being terrible
you are writing your own linked list. If you absolutely must use a linked list, use the one that was written by experts ( std::list )

Optimizing pointer copies in c++

Question

2 answers

solution1
3 2014-05-13 15:22:08

solution2
1 2014-05-13 16:08:33

Optimizing pointer copies in c++

Question

2 answers

solution1 3 2014-05-13 15:22:08

solution2 1 2014-05-13 16:08:33

solution1
3 2014-05-13 15:22:08

solution2
1 2014-05-13 16:08:33