A New Approach To High Availability: Validator Pairs
The blockscape validator exists since the birth of Cosmos Hub 3 and is run by a group of blockchain enthusiasts with both high availability and the highest level of security in mind. You can find out more about us at www.blockscape.network.
Where We Left Off
In our last article, we introduced Raftify, our second iteration of a high availability solution for Cosmos validators. For those out of the loop, here’s a brief introduction:
In a nutshell, Raftify implements the Raft leader election algorithm in order for validator clusters to manage themselves by assigning signing responsibility to the leader node. It is designed to be directly embedded into the validator software and has built-in protection against double-signing in arbitrary failure scenarios as well as during network partitions.
Raftify is currently in a good place and well on the way to its final release. With most of the features on our todo list and the remaining bugs being fixed, we’re currently reviewing the last bits of code for version 0.2.0, preparing some final internal tests and finally ready for release.
For Raftify’s final 1.0 release, we’re planning to implement the remaining set of functional and convenience features from our todo-list, do a security audit and conduct a penetration test to get Raftify ready for production.
Recently, we stumbled across an interesting Podcast from Citizen Cosmos in which Sunny Aggarwal shared his design of a non-Raft-based high availability solution which only requires a pair of nodes to be run, and which eliminates all communication overhead within the system by using the blockchain itself as a communication line between the two nodes.
I think I have a solution that’s even simpler or better than using raft between the validators. You have this high communication overhead within your system before you can make nodes, A sign every single block. You also need to have at least three nodes in order to do raft. So here’s my solution. Let’s say I had two validator nodes. Let’s say I had a primary and a backup, right? I want the primary to basically always be signing, and if it fails, you want the secondary, the backup to take its position. This would be simple to do if you had a perfect communication link, like a perfectly synchronous communication line between your two validators. But then the problem is we don’t because what if something happened between that wire that connects you to validators. Here’s the thing, we actually do have a perfectly synchronous communication link between the two nodes and it’s the blockchain itself. So, what you could do is you can make a simple rule. In Tendermint and in the Cosmos SDK staking module we kind of say, you can miss hundreds of blocks without getting in trouble, right? We can just make a simple rule that says, look, the primary is signing blocks always and the secondary is watching the blockchain. If the primary, if it ever sees 10 Tendermint blocks in which our signature is not on, the primary signature is not on it. It will start signing and what the primary will say is if I ever see 10 Tendermint blocks in a row, in which my signature is not there, I will shut off and never turn on again. I will not sign after that. This guarantees that there’s no situation in which there’s any block in which both the primary and the secondary tried to sign it.
Having implemented two high availability solutions ourselves, this sparked our interest and decided to give this idea a go.
How Does It Work?
The basic principle of the aforementioned design boils down to one validator doing all the signing work while another backup node closely monitory the signer and jumps in if the signer should ever fail to do its job.
The way this works is by having two validator nodes track the last few blocks and check whether the validator’s own signature is contained in any of them. So, if we take a range of ten blocks for example, the backup node will not jump in if the own signature is contained in at least one of the last ten blocks.
Should the backup node ever notice the own signature missing from all last ten blocks, the pair falls into an unhealthy state where no blocks have been signed for an extended period of time.
Having reached the threshold of missed block signatures, the backup node switches into the signer state and starts signing from the next block on. As soon as the failed previous signer synchronizes its local blockchain, it will also notice ten blocks or more without its signature and assume the backup node jumped in which tells him to switch into the backup state.
We find this approach highly interesting and decided to start working on it and eventually implement it. As soon as we’ve figured out all the details, we’re going to follow up with another article about our implementation. Stay tuned!