So today I was trying to optimize linked list traversal today. My thought was the it's less efficient to copy cur to last and then next to cur, when I could just do one copy. Hopefully the code below helps make it clearer:
struct Node{
int body;
Node* next;
};
Node* construct(int len){
Node *head, *ptr, *end;
head = new Node();
ptr = head;
ptr->body = 0;
for(int i=1; i<len; i++){
end = new Node();
end->next = NULL;
end->body = i;
ptr->next = end;
ptr = end;
}
return head;
}
int len(Node* ptr){
int i=1;
while(ptr->next){
ptr = ptr->next;
i += 1;
}
return i;
}
void trim(Node* head){
Node *last, *cur;
cur = head;
while(cur->next){
last = cur;
cur = cur->next;
}
last->next = NULL;
}
void tumble_trim(Node* head){ // This one uses less copies per traverse
Node *a, *b;
a = head;
while(true){
if(!a->next){
b->next = NULL;
break;
}
b = a->next;
if(!b->next){
a->next = NULL;
break;
}
a = b->next;
}
}
int main(){
int start;
Node *head;
start = clock();
head = construct(100000);
for(int i=0; i<5000; i++){
trim(head);
}
cout << clock()-start << endl;
start = clock();
head = construct(100000);
for(int i=0; i<5000; i++){
tumble_trim(head);
}
cout << clock()-start << endl;
}
The results however were quite surprising to me. In fact the one with less copies was slower:
1950000
2310000 // I expected this one to be faster
Can anyone explain why the tumble_trim() function is so slow?
You compiler is obviously optimising trim()
much more than it can tumble_trim()
. It's a prime example of keeping your code simple and readable and only trying any optimisation after you've identified a bottleneck through performance analysis. And even then you'll be hard pressed to beat the compiler on a simple loop like this.
Here's the relevant parts of the generated assembly for the two functions: (just the while loops:
trim:
LBB2_1: ## =>This Inner Loop Header: Depth=1
movq %rcx, %rax
movq %rdi, %rcx
movq 8(%rdi), %rdi
testq %rdi, %rdi
jne LBB2_1
## BB#2:
tumbletrim:
LBB3_1: ## =>This Inner Loop Header: Depth=1
movq %rdi, %rax
movq 8(%rax), %rdx
testq %rdx, %rdx
je LBB3_2
## BB#3: ## in Loop: Header=BB3_1 Depth=1
movq 8(%rdx), %rdi
testq %rdi, %rdi
movq %rdx, %rcx
jne LBB3_1
## BB#4:
movq $0, 8(%rax)
popq %rbp
ret
LBB3_2:
Now, let's try to describe what happens in each:
In trim, the following steps are performed:
In other words, each iteration contains 3 copies, 1 test and 1 jump instruction.
Now, your clever optimized tumbletrim:
In other words, in the final iteration, when you exit the loop, the total number of instructions executed is:
In all other iterations, the total count looks as follows:
So in the rare case (the last iteration before exiting the loop), your implementation is cheaper if and only if a jump instruction is cheaper than copying a pointer-sized value from register to register (which it is not)
In the common case (all other iterations, your implementation has more copies and more compares. (And more instructions, putting more load on the instruction cache. And more branch statements, putting more load on the branch cache)
Now, if you're at all concerned about performance in the first place , then there are two much more fundamental things you're doing wrong:
std::list
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.