As you may already know, on March 20th, an unfortunate misconfiguration of our active/active validator caused a double-signing incident which lead to our validator being slashed and tombstoned as a consequence. In this article, we’d like to go into a bit more detail as to what exactly we did to resolve this issue and how we move forward.
The orange box on the top represents our decentral key/value storage which is driven by the Raft consensus algorithm. This Raft component holds entries of all the messages that one of our validators has already signed in the past. Here, we can see that three blocks have been signed, namely #1, #2 and #3. We will get to why #1 and #2 are crossed out shortly.
Our two validators 1 and 2 constantly compete for the permission to sign the next message and broadcast it to the rest of the blockchain network. For example, our red validator 1 signed block #1 whereas our green validator 2 signed blocks #2 and #3. In short, as long as they take turns signing messages, everything is fine. Problems start to arise if both try to sign the same message, though.
One thing we observed during troubleshooting was that one of our validators was not on the same height as the other one — it was basically lagging behind a few blocks as seen in a very simplified manner in the illustration above. Additionally, it could not reach any of the other validators in the network and thus was not able to synchronize its local blockchain with the global blockchain state — it had basically gone blind. This by itself is not a big problem as it will be denied permission to sign the blocks that are higher than its own highest block and — this is the key factor — are persisted in the permission log.
This is where our Raft Housekeeper comes in. In order to avoid the fuss associated with coordinating which validator deletes what and at what point, we made a simple external application that clears the permission log up to a specified block height in regular intervals. This makes sure the we don’t eventually run out of storage space and both request and lookup times remain as low as possible. The entries that have been deleted by the housekeeper are crossed-out boxes in the illustration above. So, now what’s left is only entry #3 while validator 2 still thinks that block #2 hasn’t been signed yet. In other words, validator 1 had no way of knowing that block #2 had already been signed by validator 2, neither through the permission log nor by asking other validators in the network because it suffered a network partition.
Preventing our validators from lagging behind a couple more blocks than usual is something we, unfortunately, have no control over. What we can do, however, is ensure that the information necessary to prevent this from happening is there when it is needed. Here’s what we did:
- We set up additional monitoring regarding differences in block height between our validators. If our validators diverge too much, our system is going to alert us immediately.
- We configured our Raft Housekeeper to keep a minimum of 7200 entries in the permission log at all times. For an average block time of six seconds, this gives us at least 12 hours to react to incidents like this. This definitely gives us enough leeway in the long run.
Our New Validator
Now, that we’ve resolved this issue, we are thrilled to get back into the action with our new blockscape validator which has picked up operation already. You can delegate to our new address now:
We hope this article gives you some more insight into the technical details behind this incident as well as into our general thought process. We will continue to improve ourselves in the future and stay loyal to our full transparency policy with our delegators.
Thank you for trusting blockscape, see you in the next validator update!