High Availability for Cosmos Validators: Meet Raftify

blockscapeLab
11 min readMay 19, 2020

The blockscape validator exists since the birth of Cosmos Hub 3 and is run by a group of blockchain enthusiasts with both high availability and the highest level of security in mind. You can find out more about us at www.blockscape.network.

Where We Left Off

In our last article, we’ve established the theoretic foundation for a highly available validator in an active/active setup with a Raft consensus middleware layer. This approach provides a couple of key advantages that make it a reasonably good fit for use in production.

  1. Running multiple validators in parallel, backing each other up in failure scenarios, means there is no more single point of failure.
  2. Setting up an active/active cluster is quite easy as there is already out-of-the-box software that can be used for the management of the permission log (like HashiCorp’s Consul) and relatively little implementation work to be done.

For a start, this already sounds pretty good. There are, however, a couple of caveats.

  1. The process of claiming permissions over a middleware layer adds some latency to the signing process.
  2. A minimal setup already requires five nodes — three Raft nodes and two validators.
  3. Only the validator portion of the cluster scales well. Upscaling of the Raft portion comes with increased replication work and thus also adds further latency to the signing process.
    Luckily, there is no reason to go higher than a five-node Raft cluster. However, be aware of the fact that the validator portion relies on Raft to be functional. If Raft fails the validators will follow suit.

Now, the points above are by no means a deal breaker. However, there is always room for improvement. So, we decided to step things up a notch.

Meet Raftify

Raftify is an implementation of the Raft leader election algorithm and is designed to be directly embedded into an application, thus combining both the Raft and the validator layer into one single entity. It is meant to be an alternative to running an active/active validator cluster with a separate full-fledged Raft consensus layer.

Before we get into the nitty-gritty, though, let’s take a look at what makes Raftify stand out compared to our previous active/active approach:

Juxtaposition of an active/active setup with Raftify

Minimal Setup

In an active/active setup, a minimal setup already requires five nodes in total — three Raft nodes to ensure a highly available middleware layer for keeping the log of signing permissions, and two additional validator nodes backing each other up in case one of them should fail.

Raftify reduces the minimum number of cluster nodes from five down to three, each of which is a validator with an integrated Raft layer.

Failure Tolerance

Failure tolerance in an active/active setup is a double-edged sword. The cluster is basically divided into two logically separated units which, by nature, splits failure tolerance in two as well. Any sort of failure is limited to the same node type, so a validator failure will not affect the availability or tolerance of the Raft cluster and vice versa. What you do need to keep in mind, though, is that both portions strongly rely on the other to be functional in order for the entity as a whole to stay intact. No amount of validators can compensate for a non-functioning Raft cluster and no amount of Raft nodes can compensate for too few or no validators at all.

For Raftify, scaling becomes easier. The merger of Raft with the validator creates one logical unit, and thus makes all nodes share the same node type. Failure tolerance is inherited by the Raft consensus algorithm since it represents the lowest common denominator.

Scalability

Another area that is also affected by the above-mentioned logical split is scalability. Although upscaling the Raft portion always comes with increased replication work and latency, there is, luckily, no need to go any higher than three or five nodes as long as they are geographically distributed and have their own internet connection and power supply. In turn, the upscaling of the validator portion has no noticeable impact on the performance of the cluster, so those can be added without any major drawbacks.

Scalability comparison (only validator nodes)

If you compare the scalability of both setups in relation to their failure tolerance it becomes clear that, especially from a cost perspective, Raftify is a better fit for small clusters while the benefits of an active/active cluster surface with higher cluster sizes (assuming a constant baseline of three Raft nodes with a failure tolerance of one node).

When it comes to running a validator, though, we believe that a cluster size of five nodes, which means being able to tolerate two node failures at once, is the most any operator is ever going to need, and Raftify perfectly meets this requirement while reducing the number of nodes that need to be run.

Disk Space

This is a rather small one, but worth mentioning nonetheless. Pure Raft nodes need very little storage space that is used almost entirely for the operating system, a very small amount for the software itself and even less for the permission log as only the most recent log entry is relevant.

Validators, by comparison, require a relatively large amount of memory as they also need to persist their local copy of the blockchain. Since Raftify gets rid of pure Raft nodes every Raftify node needs to be treated as a full-fledged validator node and subsequently needs to be given enough storage space.

So, while Raftify requires fewer nodes to be run on the low end of the spectrum, it definitely needs more storage space than its active/active counterpart.

Log Replication

Validators in an active/active setup are not aware of each other’s presence due to there being no direct communication between them. All communication is done via the Raft cluster which needs a means to determine who to give signing permissions to. This is ultimately also why a permission log is needed as well as why latency associated with the requests needed for writing to and reading from the log is unavoidable in this instance.

Raftify, however, doesn’t need to persist any logs because it hooks into the signing process at the application level. The permission to sign a particular message only needs to be held in memory as long as it hasn’t been signed yet. After that, there is no more need to keep that assignment in the log.

How Does It Work?

In short, Raftify is an implementation of Raft’s leader election algorithm with a few modifications to ensure leader-stickiness and a single leader during network partitions.

Raft in a Nutshell

Every node in a Raft cluster can be in one of three states: Leader, Follower or Candidate. Each state comes with its own responsibilities.

  • The leader is the sole managing node in the cluster. Its job is to send heartbeats to the rest of the cluster in order to signal availability.
  • A follower is the receiving end of the leader’s heartbeats. If it doesn’t receive any heartbeat for an extended period, it assumes the leader is no longer available and becomes a candidate.
  • A candidate tries to collect votes from the majority of cluster nodes in order to become the new leader and replace the old one.

For a more in-depth explanation of the Raft consensus in terms of its inner workings and message model, please check out this link. From this point onward, we will assume you have a basic understanding of how Raft works.

Leader Stickiness

The Raft consensus holds up pretty well in an optimal scenario. There is, however, a real-world scenario, albeit a corner case, in which it doesn’t hold up so well. This case is described as “flip-flopping” and happens whenever two nodes in the cluster constantly compete for leadership. This can be caused by flaky networks where connectivity between two adjacent nodes (one of which is the current leader) breaks down multiple times for a few seconds.

Leader flip-flopping

In the example above, nodes B and C are constantly competing for leadership.

  • First, C votes for itself and receives a vote from A, so a total of two votes of a total of three. It has therefore reached the quorum and becomes the new leader. Node B will then see through node A that there is a new leader and steps down (becomes a follower).
  • After that, B doesn’t receive any heartbeats from the supposed new leader and decides to start a new candidacy. It votes for itself and receives the vote from A, so with two of three votes it has also reached the quorum and becomes the leader. The cycle repeats.

This is not a failure of Raft per se — all of Raft’s requirements are still met. However, in practice, this is undesired behavior. For this exact reason, Raftify adds a fourth state to the existing three: the PreCandidate.

A PreCandidate is the immediate successor to the follower and makes sure that there really is no current leader by asking all the other members of the cluster to confirm its suspicion. If it hears back from a majority of other PreCandidates, which are essentially other nodes that also haven’t gotten a heartbeat in a while, it can be sure that there truly is no current leader, so it can go ahead and start a new candidacy.

So, if we apply this to our previous example with node B as our initial leader, node C now won’t be able to start a new candidacy because node A will always tell it that there already is a leader which doesn’t need to be replaced.

Network Partitions

Apart from fluctuations in network connectivity, network partitions also need to be taken into consideration such that there is either no or only one single leader at all times.

Network Partition

The cluster above has been split into two sub-clusters during a network partition. A couple of observations:

  1. Neither node A nor node B will be able to reach any quorum as the two of them don’t make up the majority of nodes in a five-node cluster.
  2. The nodes C, D and E do make up the majority of the cluster and are therefore able to elect a new leader.

If you’ve paid close attention to the points above, you will have noticed a problem: if nodes C, D and E elect a new leader (which they absolutely will), there will be two leaders simultaneously — one in each sub-cluster. In a validator cluster, this would certainly lead to double-signing. This, of course, must never happen, so we need to take care of this.

So, what do we do here? It’s actually pretty simple. Up until now, non-leader nodes only had to reach a quorum of votes if they wanted to become the new leader. There was no need for the leader to keep track of anything since it was the other nodes’ job to tell it if anything had changed. Here, the other nodes have no means of telling the leader what’s going on, so the leader itself also needs to take action.

The solution to this problem is a leader quorum — we let the leader count the responses it got back from the heartbeats it sent out and check periodically if it has actually reached the majority of cluster nodes. If not, it has to assume it has been split out and a new leader will soon be elected.

Leader election during a network partition

In the picture above, our previous leader node A voluntarily stepped down as it couldn’t reach the leader quorum. This all happens before the nodes of the other sub-cluster even realize they have to start a new election. As a consequence, there will be a very short time frame in which there will be no leader (a couple of hundred milliseconds with optimal performance). For keeping our promise of having only one single leader at all times, this is a sacrifice that is absolutely worth making, though.

So, for as long as the network partition is there, nodes A and B will stay in the PreCandidate state and try reaching the quorum to start a new candidacy while we let nodes C, D and E elect a new leader with clear conscience. Once the network partition is lifted, the two sub-clusters will merge back together and the current leader will finally be able to reach the split-out nodes A and B again, thus keeping the single-leader promise.

Limitations

  • A cluster size of n can tolerate up to floor((n-1)/2) node failures.
    Example: A cluster of 5 nodes tolerates floor((5-1)/2) = 2 node failures.
  • There must never fail more than floor((n-1)/2) nodes at the same time. Once the failed nodes are kicked out of the cluster and the size shrinks, the tolerance resets to the new cluster size.

    Example 1: If in a cluster of 5 nodes 3 nodes fail in a short time frame (before the dead nodes are kicked out), the remaining 2 nodes will never be able to reach the quorum again in order to negotiate a new leader.

    Example 2: If in a cluster of 5 nodes 2 nodes fail in a short time frame, the remaining 3 nodes will still be able to reach the quorum in order to negotiate a new leader. The crashed nodes will eventually be kicked from the cluster, thus shrinking the cluster size to a total of 3 nodes and adjusting its failure tolerance to floor((3-1)/2) = 1 node.

Conclusion

Running validators in an active/active cluster with a separate Raft layer for the permission log remains a viable option for a highly available construct, especially for bigger cluster sizes.

Raftify, however, is a better fit for smaller validator clusters. Let’s summarize:

  1. A minimal Raftify setup requires only three nodes compared to five for an active/active cluster.
  2. No log replication, thus reducing communication work between nodes and latency by extension.
  3. No leader flip-flopping and guaranteed single (or no) leader during network partitions.
  4. Dynamic failure tolerance. Dead nodes will be kicked by the cluster nodes shortly after they are declared dead, thus resizing the cluster and simultaneously adjusting failure tolerance according to the new cluster size — and all this while keeping the promise of one single leader at all times.

You’ve made it until the end of the article! Thank you for reading through the whole thing, we hope you enjoyed reading it and learned something! If we sparked your interest in Raftify, you are welcome to check it out on GitHub.

--

--