IMPORTANT : Before you suggest that we use a readily available load balancer, please understand that we're not trying to load balance normal internet traffic. We receive data from a number of sources, some of which aren't internet connected (ie they may use SMS, or similar). We process these and then forward the messages on to the next stage. It is this internal step that we need to load balance. It is not using HTTP requests
Now the scenario:
I'm doing some testing on various hashing algorithms in PHP to use for load balancing. I need to guarantee that every device is load balanced to the same node. In our use case, every message has a serial number which is constant, so I'm planning on using this value. It is critical that all messages from a given device are load balanced to the same node. We are not interested in the node capacity at this point.
So, I've read a few articles about using the modulus of a hash etc to determine which node to direct to. I've seen some stating that the decimal representation of a hash from MD5, SHA1, SHA256 and SHA512 will all exceed the PHP_INT_MAX
therefore the modulus will always be zero, so we can't use that to load balance.
I've also seen suggestions that we could simply bit shift away a chunk of the hash and use only the high bits to get the modulus and again, use that.
This got me wondering... Since we only need to load balance between a relatively small number of nodes to begin (it won't exceed 16 for some time), are any algorithms adequate that we could just use the first byte and load balance based on that?
So I wrote this really simple function:
function balanceToNode(string $serial, string $algo="md5", int $nodes=1)
{
$hash = hash($algo, $serial);
return hexdec($hash[1]) % $nodes;
}
I ran this with md5
, sha1
, sha256
and sha512
against a sample of 1000 real serial numbers from our database. I tested each with 4, 8 and 16 nodes and examined the standard deviation across each of the nodes.
What I found was that using md5
resulted in predictable and well balanced data between the nodes, regardless of how many nodes we used. sha1
balanced poorly if we had a small number of nodes, but balanced better with a larger number of nodes. sha256
was even worse than sha1
but still balanced better with more nodes, and sha512
was almost unaffected by the number of nodes but still didn't balance as well as md5
.
Here is the actual data that we got from the tests. We took an estimated standard deviation from the sample to get to the numbers per node, and then the stdev of the population to get a figure for the algorithm consistency. I may be doing maths wrong! I'm looking for smaller numbers in all cases
Consistency: 0.356104153
Consistency: 3.91162024
Consistency: 6.040803998
Consistency: 0.846358482
I ran the test again, this time using the first 8 bytes from any given hash. This made a massive difference, although md5
still appears to perform the best. The big surprise for me here is that sha512
performed significantly worse with 16 than with 8
Consistency: 4.12559407
Consistency: 7.850664268
Consistency: 7.287949408
Consistency: 4.39280188
Bearing in mind that we're using the hash algorithm PURELY to distribute messages across a cluster of nodes, and not to secure passwords. Also, based on the samples, the questions are:
md5
If your serial numbers have uniform random distribution, use
n = serial % N
where n is the node to address and N is the total number of nodes.
With more serials in use, the load may be better balanced by using some middle bits rather than the lowest bits how this simple formula does. CPU caches often operate this way. But this additional complexity may or may not be worth the effort.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.