A New Approach To High Availability: Validator Pairs

The blockscape validator exists since the birth of Cosmos Hub 3 and is run by a group of blockchain enthusiasts with both high availability and the highest level of security in mind. You can find out more about us at www.blockscape.network.

Where We Left Off

In our last article, we introduced Raftify, our second iteration of a high availability solution for Cosmos validators. For those out of the loop, here’s a brief introduction:

In a nutshell, Raftify implements the Raft leader election algorithm in order for validator clusters to manage themselves by assigning signing responsibility to the leader node. It is designed to be directly embedded into the validator software and has built-in protection against double-signing in arbitrary failure scenarios as well as during network partitions.

Raftify is currently in a good place and well on the way to its final release. With most of the features on our todo list and the remaining bugs being fixed, we’re currently reviewing the last bits of code for version 0.2.0, preparing some final internal tests and finally ready for release.

For Raftify’s final 1.0 release, we’re planning to implement the remaining set of functional and convenience features from our todo-list, do a security audit and conduct a penetration test to get Raftify ready for production.

Validator Pairs

Recently, we stumbled across an interesting Podcast from Citizen Cosmos in which Sunny Aggarwal shared his design of a non-Raft-based high availability solution which only requires a pair of nodes to be run, and which eliminates all communication overhead within the system by using the blockchain itself as a communication line between the two nodes.

Having implemented two high availability solutions ourselves, this sparked our interest and decided to give this idea a go.

How Does It Work?

Healthy pair of validators

The basic principle of the aforementioned design boils down to one validator doing all the signing work while another backup node closely monitory the signer and jumps in if the signer should ever fail to do its job.

The way this works is by having two validator nodes track the last few blocks and check whether the validator’s own signature is contained in any of them. So, if we take a range of ten blocks for example, the backup node will not jump in if the own signature is contained in at least one of the last ten blocks.

Unhealthy pair of validators

Should the backup node ever notice the own signature missing from all last ten blocks, the pair falls into an unhealthy state where no blocks have been signed for an extended period of time.

State swap

Having reached the threshold of missed block signatures, the backup node switches into the signer state and starts signing from the next block on. As soon as the failed previous signer synchronizes its local blockchain, it will also notice ten blocks or more without its signature and assume the backup node jumped in which tells him to switch into the backup state.

We find this approach highly interesting and decided to start working on it and eventually implement it. As soon as we’ve figured out all the details, we’re going to follow up with another article about our implementation. Stay tuned!

Validator operator in 15+ PoS blockchains. Visit us at www.blockscape.network.