Python vs perl sort performance

Question

Solution

This solved all issues with my Perl code (plus extra implementation code.... :-) ) In conlusion both Perl and Python are equally awesome.

use WWW::Curl::Easy;

Thanks to ALL who responded, very much appreciated.

Edit

It appears that the Perl code I am using is spending the majority of its time performing the http get, for example:

my $start_time = gettimeofday;
$request = HTTP::Request->new('GET', 'http://localhost:8080/data.json');
$response = $ua->request($request);
$page = $response->content;
my $end_time = gettimeofday;
print "Time taken @{[ $end_time - $start_time ]} seconds.\n";

The result is:

Time taken 74.2324419021606 seconds.

My python code in comparison:

start = time.time()
r = requests.get('http://localhost:8080/data.json', timeout=120, stream=False)

maxsize = 100000000
content = ''
for chunk in r.iter_content(2048):
    content += chunk
    if len(content) > maxsize:
        r.close()
        raise ValueError('Response too large')

end = time.time()
timetaken = end-start
print timetaken

The result is:

20.3471381664

In both cases the sort times are sub second. So first of all I apologise for the misleading question, and it is another lesson for me to never ever make assumptions.... :-)

I'm not sure what is the best thing to do with this question now. Perhaps someone can propose a better way of performing the request in perl?

End of edit

This is just a quick question regarding sort performance differences in Perl vs Python. This is not a question about which language is better/faster etc, for the record, I first wrote this in perl, noticed the time the sort was taking, and then tried to write the same thing in python to see how fast it would be. I simply want to know, how can I make the perl code perform as fast as the python code?

Lets say we have the following json:

["3434343424335": {
        "key1": 2322,
        "key2": 88232,
        "key3": 83844,
        "key4": 444454,
        "key5": 34343543,
        "key6": 2323232
    },
"78237236343434": {
        "key1": 23676722,
        "key2": 856568232,
        "key3": 838723244,
        "key4": 4434544454,
        "key5": 3432323543,
        "key6": 2323232
    }
]

Lets say we have a list of around 30k-40k records which we want to sort by one of the sub keys. We then want to build a new array of records ordered by the sub key.

Perl - Takes around 27 seconds

my @list;
$decoded = decode_json($page);
foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
    push(@list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}

Python - Takes around 6 seconds

list = []
data = json.loads(content)
data2 = sorted(data, key = lambda x: data[x]['key5'], reverse=True)

for key in data2:
     tmp= {'id':key,'key1':data[key]['key1'],etc.....}
     list.append(tmp)

For the perl code, I have tried using the following tweaks:

use sort '_quicksort';  # use a quicksort algorithm
use sort '_mergesort';  # use a mergesort algorithm

Answer 1

Your benchmark is flawed, you're benchmarking multiple variables, not one. It is not just sorting data, but it is also doing JSON decoding, and creating strings, and appending to an array. You can't know how much time is spent sorting and how much is spent doing everything else.

The matter is made worse in that there are several different JSON implementations in Perl each with their own different performance characteristics. Change the underlying JSON library and the benchmark will change again.

If you want to benchmark sort, you'll have to change your benchmark code to eliminate the cost of loading your test data from the benchmark, JSON or not.

Perl and Python have their own internal benchmarking libraries that can benchmark individual functions, but their instrumentation can make them perform far less well than they would in the real world. The performance drag from each benchmarking implementation will be different and might introduce a false bias. These benchmarking libraries are more useful for comparing two functions in the same program. For comparing between languages, keep it simple.

Simplest thing to do to get an accurate benchmark is to time them within the program using the wall clock.

# The current time to the microsecond.
use Time::HiRes qw(gettimeofday);

my @list;
my $decoded = decode_json($page);

my $start_time = gettimeofday;

foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
    push(@list,{"key"=>$id,"key1"=>$decoded->{$id}{key1}...etc));
}

my $end_time = gettimeofday;

print "sort and append took @{[ $end_time - $start_time ]} seconds\n";

(I leave the Python version as an exercise)

From here you can improve your technique. You can use CPU seconds instead of wall clock. The array append and cost of creating the string are still involved in the benchmark, they can be eliminated so you're just benchmarking sort. And so on.

Additionally, you can use a profiler to find out where your programs are spending their time. These have the same raw performance caveats as benchmarking libraries, the results are only useful to find out what percentage of its time a program is using where, but it will prove useful to quickly see if your benchmark has unexpected drag.

The important thing is to benchmark what you think you're benchmarking.

Answer 2

Something else is at play here; I can run your sort in half a second. Improving that is not going to depend on sorting algorithm so much as reducing the amount of code run per comparison; a Schwartzian Transform gets it to a third of a second, a Guttman-Rosler Transform gets it down to a quarter of a second:

#!/usr/bin/perl
use 5.014;
use warnings;

my $decoded = { map( (int rand 1e9, { map( ("key$_", int rand 1e9), 1..6 ) } ), 1..40000 ) };

use Benchmark 'timethese';

timethese( -5, {
    'original' => sub {
        my @list;
        foreach my $id (sort {$decoded->{$b}->{key5} <=> $decoded->{$a}->{key5}} keys %{$decoded}) {
            push(@list,{"key"=>$id,%{$decoded->{$id}}});
        }
    },
    'st' => sub {
        my @list;
        foreach my $id (
            map $_->[1],
            sort { $b->[0] <=> $a->[0] }
            map [ $decoded->{$_}{key5}, $_ ],
            keys %{$decoded}
        ) {
            push(@list,{"key"=>$id,%{$decoded->{$id}}});
        }
    },
    'grt' => sub {
        my $maxkeylen=15;
        my @list;
        foreach my $id (
            map substr($_,$maxkeylen),
            sort { $b cmp $a }
            map sprintf('%0*s', $maxkeylen, $decoded->{$_}{key5}) . $_,
            keys %{$decoded}
        ) {
            push(@list,{"key"=>$id,%{$decoded->{$id}}});
        }
    },
});

Answer 3

Don't create a new hash for each record. Just add the key to the existing one.

$decoded->{$_}{key} = $_
   for keys(%$decoded);

my @list = sort { $b->{key5} <=> $a->{key5} } values(%$decoded);

Using Sort::Key will make it even faster.

use Sort::Key qw( rukeysort );

$decoded->{$_}{key} = $_
   for keys(%$decoded);

my @list = rukeysort { $_->{key5} } values(%$decoded);

Python vs perl sort performance

Question

3 answers

solution1
7 ACCPTED 2015-07-31 18:52:48

solution2
6 2015-07-31 18:55:35

solution3
4 2015-07-31 19:08:46

Python vs perl sort performance

Question

3 answers

solution1 7 ACCPTED 2015-07-31 18:52:48

solution2 6 2015-07-31 18:55:35

solution3 4 2015-07-31 19:08:46

solution1
7 ACCPTED 2015-07-31 18:52:48

solution2
6 2015-07-31 18:55:35

solution3
4 2015-07-31 19:08:46