I have a problem with object comparison in PHP. What seems like a straightforward code actually runs way too slow for my liking and as I am not that advanced in the language I would like some feedback and suggestions regarding the following code:
class TestTokenGroup {
private $tokens;
...
public static function create($tokens) {
$instance = new static();
$instance->tokens = $tokens;
...
return $instance;
}
public function getTokens() {
return $this->tokens;
}
public static function compare($tokenGroup1, $tokenGroup2) {
$i = 0;
$minLength = min(array(count($tokenGroup1->getTokens()), count($tokenGroup2->getTokens())));
$equalLengths = (count($tokenGroup1->getTokens()) == count($tokenGroup2->getTokens()));
$comparison = strcmp($tokenGroup1->getTokens()[$i], $tokenGroup2->getTokens()[$i]);
while ($comparison == 0) {
$i++;
if (($i == $minLength) && ($equalLengths == true)) {
return 0;
}
$comparison = strcmp($tokenGroup1->getTokens()[$i], $tokenGroup2->getTokens()[$i]);
}
$result = $comparison;
if ($result < 0)
return -1;
elseif ($result > 0)
return 1;
else
return 0;
}
...
}
In the code above $tokens
is just a simple array of strings.
Using the method above through usort()
for an array of TestTokenGroup
consisting of around 40k objects takes ~2secs.
Is there a sensible way to speed that up? Where is the bottleneck here?
EDIT: Added the getTokens() method I initially forgot to include.
You know that objects are "pass by reference", and arrays are "pass by value"?
If getTokens()
returns $this->tokens
, the array is copied every time you invoke that method.
Try accessing $tokens
directly via $tokenGroup1->tokens
. You could also use references ( &
) although returning a reference doesn't work in all PHP versions.
Alternatively, make one copy only:
$tokens1 = $tokenGroup1->getTokens();
$tokens2 = $tokenGroup2->getTokens();
Even if each token group is relatively small, it will save at least 40000 * ( 6 + $average_token_group_length * 2)
array copies.
UPDATE
I've benchmarked OP's code (removing the ...
lines) using:
function gentokens() {
$ret = [];
for ( $i=0; $i< 3; $i++)
{
$str = "";
for ( $x = rand(0,3); $x < 10; $x ++ )
$str .= chr( rand(0,25) + ord('a') );
$ret[] = $str;
}
return $ret;
}
$start = microtime(true);
$array = []; // this will hold the TestTokenGroup instances
$dummy = ""; // this will hold the tokens, space-separated and newline-separated
$dummy2= []; // this will hold the space-concatenated strings
for ( $i=0; $i < 40000; $i++)
{
$array[] = TestTokenGroup::create( $t = gentokens() );
$dummy .= implode(' ', $t ) . "\n";
$dummy2[] = implode(' ', $t );
}
// write a test file to benchmark GNU sort:
file_put_contents("sort-data.txt", $dummy);
$inited = microtime(true);
printf("init: %f s\n", ($inited-$start));
usort( $array, [ 'TestTokenGroup', 'compare'] );
$sorted = microtime(true);
printf("sort: %f s\n", ($sorted-$inited));
usort( $dummy2, 'strcmp' );
$sorted2 = microtime(true);
printf("sort: %f s\n", ($sorted2-$sorted));
With the following results:
init: 0.359329 s // for generating 40000 * 3 random strings and setup
sort: 1.012096 s // for the TestTokenGroup::compare
sort: 0.120583 s // for the 'strcmp' compare
And, running time sort sort-data.txt > /dev/null
yields
.052 u (user-time, in seconds).
optimisation 1: remove array copies
replacing ->getTokens()
with ->tokens
yields (I'll only list the TestTokenGroup::compare
results):
sort: 0.832794 s
Optimisation 2: remove redundant array()
in min
Changing the $minlength
line to:
$minLength = min(count($tokenGroup1->tokens), count($tokenGroup2->tokens));
gives
sort: 0.779134 s
Optimisation 3: Only call count
once for each tokenGroup
$count1 = count($tokenGroup1->tokens);
$count2 = count($tokenGroup2->tokens);
$minLength = min($count1, $count2);
$equalLengths = ($count1 == $count2);
gives
sort: 0.679649 s
Alternative approach
The fastest sort so far is strcmp( $stringarray, 'strcmp' )
: 0.12s - still twice as slow as GNU sort, but the latter only does one thing, and does it well.
So, to sort the TokenGroups efficiently we need to construct sort key consisting of a simple string. We can use \\0
as a delimiter for the tokens, and we don't have to worry about them being equal length, because as soon as one character is different, the compare aborts.
Here's the implementation:
$arr2 = [];
foreach ( $array as $o )
$arr2[ implode("\0", $o->getTokens() ) ] = $o;
$init2 = microtime(true);
printf("init2: %f s\n", ($init2-$sorted2));
uksort( $arr2, 'strcmp' );
$sorted3 = microtime(true);
printf("sort: %f s\n", ($sorted3-$init2));
and here the results:
init2: 0.125939 s
sort: 0.104717 s
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.