简体   繁体   中英

Hash algorithms for load balancing

IMPORTANT : Before you suggest that we use a readily available load balancer, please understand that we're not trying to load balance normal internet traffic. We receive data from a number of sources, some of which aren't internet connected (ie they may use SMS, or similar). We process these and then forward the messages on to the next stage. It is this internal step that we need to load balance. It is not using HTTP requests

Now the scenario:

I'm doing some testing on various hashing algorithms in PHP to use for load balancing. I need to guarantee that every device is load balanced to the same node. In our use case, every message has a serial number which is constant, so I'm planning on using this value. It is critical that all messages from a given device are load balanced to the same node. We are not interested in the node capacity at this point.

So, I've read a few articles about using the modulus of a hash etc to determine which node to direct to. I've seen some stating that the decimal representation of a hash from MD5, SHA1, SHA256 and SHA512 will all exceed the PHP_INT_MAX therefore the modulus will always be zero, so we can't use that to load balance.

I've also seen suggestions that we could simply bit shift away a chunk of the hash and use only the high bits to get the modulus and again, use that.

This got me wondering... Since we only need to load balance between a relatively small number of nodes to begin (it won't exceed 16 for some time), are any algorithms adequate that we could just use the first byte and load balance based on that?

So I wrote this really simple function:

function balanceToNode(string $serial, string $algo="md5", int $nodes=1)
{
    $hash = hash($algo, $serial);
    return hexdec($hash[1]) % $nodes;
}

I ran this with md5 , sha1 , sha256 and sha512 against a sample of 1000 real serial numbers from our database. I tested each with 4, 8 and 16 nodes and examined the standard deviation across each of the nodes.

Single Byte Test

What I found was that using md5 resulted in predictable and well balanced data between the nodes, regardless of how many nodes we used. sha1 balanced poorly if we had a small number of nodes, but balanced better with a larger number of nodes. sha256 was even worse than sha1 but still balanced better with more nodes, and sha512 was almost unaffected by the number of nodes but still didn't balance as well as md5 .

Here is the actual data that we got from the tests. We took an estimated standard deviation from the sample to get to the numbers per node, and then the stdev of the population to get a figure for the algorithm consistency. I may be doing maths wrong! I'm looking for smaller numbers in all cases

MD5

  • 4 nodes: 8.041558721
  • 8 nodes: 7.171371656
  • 16 nodes: 7.554248253

Consistency: 0.356104153

SHA1

  • 4 nodes: 17.53092506
  • 8 nodes: 13.24494513
  • 16 nodes: 7.966596931

Consistency: 3.91162024

SHA256

  • 4 nodes: 25.81988897
  • 8 nodes: 15.7116881
  • 16 nodes: 11.40741718

Consistency: 6.040803998

SHA512

  • 4 nodes: 11.5758369
  • 8 nodes: 10.87592361
  • 16 nodes: 9.535897091

Consistency: 0.846358482

8 Byte Test

I ran the test again, this time using the first 8 bytes from any given hash. This made a massive difference, although md5 still appears to perform the best. The big surprise for me here is that sha512 performed significantly worse with 16 than with 8

MD5

  • 4 nodes: 18
  • 8 nodes: 13.53302838
  • 16 nodes: 7.916228058

Consistency: 4.12559407

SHA1

  • 4 nodes: 27.41046029
  • 8 nodes: 17.63114128
  • 16 nodes: 8.181279444

Consistency: 7.850664268

SHA256

  • 4 nodes: 25.31139401
  • 8 nodes: 15.25029274
  • 16 nodes: 7.509993342

Consistency: 7.287949408

SHA512

  • 4 nodes: 17.60681686
  • 8 nodes: 6.886840453
  • 16 nodes: 11.44261042

Consistency: 4.39280188

My Actual Question

Bearing in mind that we're using the hash algorithm PURELY to distribute messages across a cluster of nodes, and not to secure passwords. Also, based on the samples, the questions are:

  1. Am I safe to use the first byte only?
  2. Is it OK to use md5
  3. Am I doing maths wrong?

If your serial numbers have uniform random distribution, use

n = serial % N

where n is the node to address and N is the total number of nodes.

With more serials in use, the load may be better balanced by using some middle bits rather than the lowest bits how this simple formula does. CPU caches often operate this way. But this additional complexity may or may not be worth the effort.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM