ETH: Why Client Diversity Matters



With the launch of the Medalla testnet, the official team encourages people to experiment with different clients. The importance of doing this was highlighted from the moment of genesis: Nimbus and Lodestar nodes were stuck unable to handle the testnet load. As a result, Medalla was unable to finalize a block within half an hour of going live.

On August 15, Beijing time, due to the sudden deviation of the clock server used by the Prysm client as a reference, the clock of the Prysm node was advanced by 4 hours. Therefore, these nodes have been creating blocks and attestations for the leading slot. Validators who had disabled the default slash protection found themselves slashed when the clocks of those nodes returned to normal.

For more details, I strongly recommend you to read Raul Jordan's "ETH2 Medalla Testnet Incident" article (editor's note: see the hyperlink at the end of the article).

When clock skew occurs, Prysm nodes account for about 62% of the total network nodes. This means that the network cannot reach the minimum participation rate (> 2/3) required to finalize a block. To make matters worse, these nodes can't find the top of the blockchain they expect (there is a "gap" of up to 4 hours in the history, and there is a slight deviation between the clocks of all Prysm nodes), so these nodes In guessing "missing" data, many short forked chains are created, causing network congestion.

-Currently, Prysm nodes account for 82% of all nodes in the Medalla testnet?! ( -

At this point, the network is flooded with thousands and counting guesses about the top of the blockchain, and all clients are beginning to be overwhelmed trying to discern which fork is correct. This leads to stalled, out-of-sync, and out-of-memory nodes, which worsen the situation.

How can we know that it is not a blessing if it is a blessing in disguise. After this incident, we were able to not only fix the fundamental issue with the clock, but also stress test the client under massive node failures and heavy network load. Nevertheless, this accident would not have caused such extreme consequences. The fundamental reason is that the proportion of Prysm nodes is too large.

As I discussed earlier, 1/3 is a safe threshold for asynchronous Byzantine fault-tolerant algorithms. If more than 1/3 of validators are offline, the network cannot achieve finality. Although the ETH 2.0 blockchain is growing, verifiers cannot guarantee which block and which state will not be subverted.

Fundamentally, we want the economic incentives to be strong enough for validators to do what is good for the network without us having to trust them to be good people.

If more than 1/3 of validator nodes are offline, the penalty for offline nodes will be increased. This is known as an inactivity penalty.

That is, as a validator, you want to be forced to go offline for some reason without a lot of other nodes going offline for the same reason.

The same goes for forfeiting. While it is possible for your validator nodes to be slashed due to specification or software glitches/bugs, individual slashes will only cost 1 ETH.

However, if many validators are slashed at the same time as you (reaching the 1/3 safety threshold), the penalty can be as high as 32 ETH. (For more details, see this article.)

The above two situations are called liveness anti-correlation and safety anti-correlation respectively, which are carefully designed parts of ETH 2.0. The anti-correlation mechanism links individual penalties to each validator's influence on the network, thereby incentivizing validators to make decisions that are most beneficial to the network.

ETH 2.0 is being implemented by multiple independent teams. Each team develops an independent client based on the specifications written by the ETH 2.0 research team. This ensures that there are multiple beacon chain node and validator client implementations. When building an ETH 2.0 client, each client team will make different decisions regarding technologies, languages, optimizations, and trade-offs. In this way, even if there is a vulnerability in any layer of the ETH 2.0 system, it will only affect the nodes running a specific client, and will not affect the entire network nodes.

Take the clock skew of a Prysm node on the Medalla testnet as an example. If only 20% of ETH 2.0 nodes are running the Prysm client, and 85% of validators are online, then Prysm nodes will not be penalized for inaction. The development team can solve this problem with only a few overnight stays, and the punishment can be kept to a minimum.

In fact, between 3,500 and 5,000 validators were slashed in a short period of time due to the fact that too many validators were concentrated on the same client (and many validators had slash protection disabled). *Such a high correlation means that these validators lost about 16 ETH just because they were running a popular client.

As of the time of writing this article, the amount of fines and confiscations is still increasing significantly, and the final data has not yet been obtained.

Now is the time to try out different clients. May wish to experience the niche client. (Click here to see the distribution of validators.) Currently, Lighthouse, Teku, Nimbus, and Prysm are all relatively stable, and Lodestar is catching up.

Most importantly, be sure to try out the new client! We can rationalize the distribution of validators for different clients on Medalla in order to welcome the launch of the ETH 2.0 mainnet.

Original link:

Author: Carl Beekhuizen

Translation & proofreading: Min Min & A Jian


ETH: Why Client Diversity Matters

