Replication - How Is It Done In Distributed Systems?

Replication simply means keeping a copy of the same data on multiple machines connected via a network. These machines could be present in different parts of the world and not necessarily in the same perimeter.

There are many benefits of having this redundancy that you will see later in this article.

The objective is this article is to provide some insights into different ways this data replication is done in a distributed system. As you will see this replication is not easy and there are tons of tradeoffs that engineers have to make along the way depending on different systems use case.

Table of Contents

Toggle

Why Replication Is Important?

There are so many reasons for replicating the data. Some of the important ones are:

To Increase Availability. There could be times when parts of your system would stop working (hardware failure, network failure, hazards, etc… etc…) and having a replication helps the system to working and stay available.
To Reduce Latency. Replication is also done to keep the data close to the users so that the time required for data to reach user’s machine is minimal.
To Increase Read Throughput. Replication is done to scale out number of machine that can serve read queries to increase the throughput of the system.

What’s So Difficult?

You may ask, well, it’s just replication, what could be so difficult.

This is a good question. Replication in itself is not very difficult, it is the nature of data.

If the data is static and non-changing then replication would be so much simple. But in the real-world, data is something that keeps on changing and maintaining that change across multiple servers is something that requires a lot of thinking.

The following three are the most popular techniques/algorithms to do replication:

Single-Leader Replication
Multi-Leader Replication
Leaderless Replication

They all have various pros and cons and are used depending on the type of system use-case.

Leaders and Followers

I defined replication as storing copies of the same data on multiple machines. But then the inevitable question arises – How do we ensure that all the data is copied successfully across all servers/replicas?

The answer is simple – every write will be needed to be processed by each replica, otherwise, the replicas won’t contain the same data and will become useless.

Then comes the question of How.

How can we make all replicas process every write?

There are many ways to do this, but the most common solution that I know of is leader-based replication.

The process goes as follows:

One of the replica is designated the leader. Every request from the client is sent to the leader. The leader first writes the data to its own storage (local storage).
The other replicas are called followers. Now as soon as the leader finishes writing data to its local storage, it sends the data to all its followers as part of the replication log or change stream (a subject in itself :D). Then each follower takes this replication log and apply the changes accordingly in the same order they were processed by the leader.
All the read queries made by the clients now can be served from either leader or replicas but the writes will only be processed by the leader. In other words, the followers are read-only for the outside world.

***Reference: Martin Kleppman’s Designing Data-Intensive Applications***

Another obvious question that comes after looking at this image is – Is this replication taking place synchronously or asynchronously?

Let’s tackle this question now.

Synchronous VS Asynchronous Replication

Let’s walk through the above image and see what happens next when the user updates the username.

The update request is sent from the client reaches to the leader replica. Upon receiving the request, the leader writes the data to its local storage and then forwards the data change to its followers. Eventually, the leader notifies the client that the data update was successful.

Let’s map out the communication that is taking place. In the following example image (time flows from top-to-bottom), the replication to follower 1 is synchronous. Synchronous means that the leader waits for the response to come back from follower 1 before reporting success to the user and before making the commit (making writes visible to other users).

On the other hand, follower 2 is asynchronous, the leader sends the message but doesn’t wait for a response from the follower.

**Leader based replicas with one sync and one async**

As you can see in the diagram that the response from follower 2 takes a substantial delay. Usually, it’s very fast (less than a second) but there is no guarantee. There could be scenarios where the follower falls behind the leader by several minutes. It depends on the network conditions, load on the server and many different factors.

There are both advantages and disadvantages with the above setup.

Advantage

The benefit of having a synchronous replica is that we can be sure that there will be at least one follower with the most up-to-date data. Since every write is made to the leader and the follower in the same request. This guarantees consistency as well, if suddenly a leader fails there will be at least one replica with the up-to-date data. There is no data loss.

Disadvantage

The leader now has to wait for the follower to finish the writing of the data. There could be scenarios when there is a network partition and the follower is not accessible. In that case, the writes will not be processed and it will put the entire system to block writes until the follower is available again. So it is crucial that synchronous follower is up all the time with leader replica.

Usually, asynchronous replications are used in the distributed system. In that case, if the leader fails then any write that has not been replicated successfully will be lost even after responding to the client that their data has been persisted successfully.

Does it sound like a good option to weaken durability for availability?

Well, this is where the engineers have to make the trade-offs depending on the use case of the system.

Sometimes, it is very important to have data durability then availability. Then synchronous replication makes more sense. And other times asynchronous replication makes more sense.

Conclusion

Data replication is a centre of continuous research in distributed systems. How to make the systems available and consistent at the same time. CAP theorem comes into the picture but still, we can strive for the best. This was a small introductory article on replication. There are many challenges that come up with Replication like setting up a new follower, handling outages, follower failure, leader failure, leader election etc… but it’s very exciting to understand and implement these algorithms and tweak them around to see different results 🙂

The major difference between a thing that might go wrong and a thing that cannot possible go wrong is that when a thing that cannot possible go wrong goes wrong it usually turns out to be impossible to get at or repair.
~ Douglas Adams

Let me know your thoughts on the same.