In a horizontally scaled architecture, data is distributed across multiple servers to ensure proper load distribution. But how do we determine which piece of data is stored on which server? And how do we handle the redistribution of data when servers are added or removed? Let’s find out.
Hashing
Hashing is a mechanism where we use an array of available servers to distribute objects. The key of an object is fed into a hashing function, and the remainder of the result divided by the total number of servers determines the server that will store the object:
server = hash(object_key) % no. of servers
Example
Let’s break it down with an example:
Assume we have 10 servers in total.
We feed our object key into a hash function, and the output is 133.
Divide 133 by 10 to get the remainder, which is 3 (133 % 10 = 3).
Server 3 is selected to hold the object.
This same mechanism can also be used to retrieve the object—just hash the object key again and identify the server.
The Problem
At first glance, this approach seems fine. However, issues arise when the number of servers changes. For instance:
If server 3 goes down, the number of servers becomes 9.
This changes the result of the hash function and leads to a data miss for many objects.
Worse, this requires a large-scale redistribution of data across the remaining servers.
Consistent Hashing
To address this problem, we use consistent hashing, a technique that minimizes data redistribution when servers are added or removed.
How Consistent Hashing Works
Instead of a variable-length array, consistent hashing uses a ring structure with a fixed number of partitions.
Both the object keys and the server identifiers (e.g., IP addresses or domain names) are hashed and placed on the ring.
Starting from the location of an object key on the ring, we move clockwise to find the first server. This server stores the object.
Benefits
When a server is removed, only the data mapped to that server needs redistribution, and it is reassigned to the next server clockwise on the ring.
This significantly reduces the amount of data that needs to be redistributed compared to the basic hashing mechanism.
Adding new servers follows the same principle, with minimal impact on existing data placement.
Example with Consistent Hashing
Suppose we have a ring with 360 partitions (to minimize collisions).
An object key hashes to position 72 on the ring.
Server identifiers are also hashed to positions on the ring, e.g., Server A at 50, Server B at 100, and so on.
Starting from 72, we move clockwise to 100 (Server B), which stores the object.
If Server B is removed, objects from 72 to 100 are reassigned to the next server clockwise.
Conclusion
Consistent hashing is a powerful technique for distributed systems, ensuring efficient data distribution with minimal redistribution. It is widely used in systems like distributed caches, object storage, and content delivery networks to achieve scalability and resilience.