A Guide to High Availability for Cosmos Validators

7 min readDec 18, 2019

The blockscape validator exists since the birth of Cosmos Hub 3 and is run by a group of blockchain enthusiasts with both high availability and the highest level of security in mind. You can find out more about us at www.blockscape.network.

Guide Assumptions

This guide is designed for validator operators who want to make their own validator highly available. It assumes you have a basic understanding of Tendermint and general Proof of Stake related terminology.

By following along with this guide, you’ll get a good understanding of why high availability is necessary to operate a validator responsibly and how it can be achieved.

Why High Availability?

Cosmos is based on the premise that only a limited set of network participants, determined by their respective voting power, are able to actively participate in consensus. Those who wish to partake without having to run their own validator can do so by delegating their funds to other validators in said set. Both the validator and the delegator benefit from this relationship — in exchange for staking on a validator, and thus increasing its voting power, the delegator is rewarded proportionately.

This, however, requires delegators to trust validators to play by the rules. Tendermint employs a slashing mechanism that punishes any behavior that puts consensus at risk. Aside from obvious fraud such as crafting malicious transactions, this also includes unintentional behavior that you as an operator have no control over. Among other things, this includes unexpected downtime due to random power outages or the loss of network connection. Should one of these things happen to you, you will be unable to participate in consensus and collect rewards for yourself and your delegators until you manually resolve the situation which both creates unnecessary busywork and costs you and your delegators money. As a consequence, you are likely to lose existing delegators and scare off future ones.

It is therefore in your interest to avoid unexpected downtime in order to maintain a good reputation among delegators and create incentives for more people to delegate their funds to you. As of now, there is no out-of-the-box solution for this problem which is why validators are stuck creating their own high availability strategy. Thus, in order to help others solve this problem and by extension make the network more secure, we would like to show you how high availability can be achieved in the Cosmos Network.

Goals

Before we get started, let’s be clear about what a highly available validator actually incorporates and what we are trying to achieve here.

In case our validator suffers a power outage, loses its network connection or crashes due to a hardware failure we want it to stay online and not get slashed.
We want the validator to be able to handle most of such scenarios on its own in order to minimize manual intervention as much as possible.

Architecture

The illustration below gives a quick overview of the core components needed to make a highly available validator possible. If you don’t know what some of the words mean, don’t worry. We are going to cover everything step by step.

Alright, we’re all set. Let’s break it all down, shall we?

Redundancy

In order to achieve high availability, we need some form of redundancy. The idea behind this is to have a bunch of validators backing each other up in case one or more of them fail. Furthermore, these validators will be running in parallel, sharing no data or resources amongst each other. Each one will be doing its own thing — this includes keeping their local copy of the blockchain up to date using their own resources only. This is also known as an active/active cluster and a shared-nothing architecture which avoids all the fuss associated with having to coordinate a shared state among the cluster members.

We also don’t want unexpected downtime to affect all validators simultaneously as that would render our efforts to make it highly available useless. That’s why each validator will be put into a separate availability zone in the cloud, a geographically distributed area with its own power supply and network connection.

Clustered Entity

Alright, so right now we have multiple validators with each one having its own key pair and thus representing a separate entity with its own wallet. With this setup, however, if a validator becomes unavailable the entity as a whole is also inevitably lost in the process which causes it to get slashed. So, what we need to do is ensure all of our physical validators represent one single entity.

So, how do we do this? We simply provide all validators with the same key pair. This means that all of the rewards collected from minting and validating blocks will be accumulated in one single wallet and all votes/proposals will be signed with the same private key. This way, even if one validator fails, the entity as a whole stays intact because it is not only represented by a single physical validator.

Coordination of Signing Permissions

If you’re somewhat familiar with Tendermint’s slashing mechanism that last paragraph should have raised a big red flag, and rightfully so. Letting multiple validators use the same private key simultaneously is a certain way to get slashed heavily and eventually jailed for double-signing.

In order to solve this problem, our cluster members need to coordinate signing permissions for each message before it is propagated into the network. We need some sort of superordinate middleware layer that can be used for reaching consensus inside of our cluster on which validator gets permission to sign the next message.

Coordination of Signing Permissions in an Active/Active Cluster

This can be done by having our middleware layer keep a record of already signed messages. A validator will be able to claim permission to sign a message if no entry for that message already exists. The rest of the cluster members will also try to claim the permission, but fail due to there already being an entry for it.

To sum it up, you can think of it as a cluster-wide competition for signing permissions. The first validator to create an entry for a specific message claims the right to sign that particular message. The others will see this entry and know that it has already been signed and therefore wait for the next message to claim permission for. So, as long as there is at least one validator claiming permissions, the validator and the entity as a whole stays online.

Awesome! With this, we’ve achieved our first goal!

As you might have already noticed, the middleware is an integral part of the system. It is of utmost importance to make it highly available, too. This is where the consensus algorithm Raft comes into play. It allows a server cluster to manage itself by employing a dynamic master-slave system which can tolerate (n-1)/2 failing nodes in a cluster of size n. It essentially keeps a key/value store of the signing permissions and replicates them across the other Raft nodes in the cluster. As with the validators, these Raft nodes should be geographically distributed and have both their own power supply and network connection.

For our Raft implementation, we opted for HashiCorp’s Consul as it also comes with a pretty neat REST API that the validators can talk to in order to commit entries to the log which are used to coordinate signing permissions. You can check out their documentation on how to setup a cluster. For the sake of simplicity, we went with a three-node Raft cluster which can tolerate one node failure. This should be considered a minimal setup. We wouldn’t recommend going any higher than five nodes (tolerates two node failures). Keep in mind that the higher the number of nodes the more replication needs to be done which leads to increased latency.

Further Considerations

A pleasant side effect of this architecture is that it removes the need for sentry nodes sincere is no single target anymore that needs to be protected from DDoS attacks. In this setup, we instead just multiply this target such that they protect each other. Even if one of the validators is under attack, another one from the active/active cluster can still keep signing messages.

Of course, you can always add more validators to your cluster without impacting the performance of the system. In the context of DDoS protection, you just need to make sure that at least one validator is up and running at all times.

Caveats

This article does not cover private key security. Please consider using a hardware security module like the YubiHSM 2. To make it work with the validator software, Certus One’s Aiakos is a great starting point.
Keep in mind that you need physical access to your validator machines if you’re using an HSM. If you can’t run your validators in multiple geographically distributed locations you can use an uninterruptible power supply system for at least one of your machines to prevent the whole cluster from being affected by power outages at once. The risk of losing network connection, however, remains. An alternative approach would be to create a local highly available cluster of HSMs that the validators could talk to.
Please be aware that the additional communication layer provided by the middleware adds latency to the signing process, depending on the geographical location and your relative distance to the cluster nodes. This setup has only been tested with the default timeout settings, so please test your custom setup thoroughly in the latest Testnet before you go live on the Mainnet.
Monitoring is also not covered in this article. It should not be neglected, though, to also cover the cases that require manual intervention. These include the failure of the majority of Raft nodes or the failure of all validators in the active/active cluster.

Conclusion

And there you go! A highly available cluster of validators that withstands unexpected downtime and does not double-sign. As long as at least one validator in the active/active cluster and the majority of nodes in the Raft cluster are up and running, the validator entity as a whole remains intact.

We hope some of you out there found this guide useful and that you had as much fun reading as we had writing it! If you have any questions, feel free to contact us!