Summary

On March 13, 2024, the Blast Mainnet sequencer stopped producing new blocks from 14:05 to 15:10 UTC. The root cause was the node software’s incomplete support of Ethereum’s post-Dencun block header shape and the decision to strictly validate all L1 block headers.

Impact

During these 65 minutes, users were unable to execute new transactions, see their L1 deposits reflected in their L2 wallets, and initiate new withdrawals. Existing withdrawal requests that had passed the challenge period were unaffected and remained claimable on the L1.

Public RPCs and block explorers remained online, but the lack of new blocks led to a confusing user experience. Once the sequencer started producing blocks again, the primary RPC at https://rpc.blast.io and two other public RPCs (3/5 total) automatically kept up with the new blocks, restoring normal behavior. However, the other two public RPCs were stuck on the block from 14:05 UTC, 764454. This led to some users seeing incorrect nonce errors in their wallets until those RPCs updated to the patched node software.

Background

The instigating event for this incident was Ethereum’s Dencun upgrade which activated at 13:55 UTC. Dencun had been on our radar for a while, and we had prepared the Blast codebase for it during the Blast Sepolia Testnet. Dencun activated on Sepolia more than a month ago, and the Blast Testnet had been successfully handling the new block header shape for weeks.

Since the Blast Testnet adheres to a subtly different specification than Mainnet, there is a separate branch for each environment. This meant that changes to testnet needed to be ported over to the mainnet branch. The primary impact of these changes is to enable the node software to validate the consistency of the block headers it receives from the configured L1 provider. This validation step is optional and can be turned off if the L1 provider is trusted.

Failure to validate correct headers unnecessarily disrupts block production, which is what happened in this incident. Disabling validation would allow block production to continue, but doing so could lead to dramatically worse consequences than just downtime if we were to ever receive incorrect L1 block data. Blast Mainnet runs with this validation step enabled, representing a preference for consistency over availability.

Root cause

When we cherry-picked the required changes for Dencun from testnet to mainnet, we missed one of the required commits. This change underwent code review two times: once in isolation when the Dencun changes were first merged to the mainnet branch, and again in aggregate when multiple people reviewed the entire diff between testnet and mainnet.

The change in question involves a few dozen lines across three files, and the reviewers confirmed that all three files were modified and that the majority of the required diff was present. Unfortunately, we failed to notice that one of the files was missing 6 key SLOCs.

Since the Blast Testnet had been running on Dencun successfully for a while, our primary concern was actually that the new mainnet codebase wouldn’t work in a pre-Dencun environment. To this end, we deployed an additional private testnet running against Ethereum mainnet to ensure that the new code could handle both pre- and post-Dencun environments. However, we did not complete the additional verification step of running the new mainnet code against a post-Dencun L1.

Timeline

All times in UTC.

13:55 T-10m Dencun activates on Ethereum mainnet
13:56 T-9m Monitoring system detects that there’s an issue, but does not escalate yet
13:59 T-6m The issue persists long enough to trigger pagerduty, alerting primary oncall, who begins to investigate
14:05 T+0m Sequencer stops producing blocks
14:07 T+2m Issue escalated internally to include secondary oncall who begins to help
14:14 T+9m We confirm the root cause of the issue is not with the underlying L1 provider
14:32 T+27m We realize that our infrastructure automation’s safety checks for sequencer deployments are going to noticeably delay our release of the fix; we share that the fix ETA is 30min internally
14:36 T+31m Finished implementing a fix and corresponding unit test
14:41 T+36m Publicly announce a 30-60min ETA for the fix: https://twitter.com/Blast_L2/status/1767924021872455902?s=20
15:10 T+65m Sequencer starts producing blocks again; we confirm public RPC/block explorer are in sync

Learnings & Changes

More redundant QA

One of the mistakes we made was assuming that because the Dencun header change had been tested in one post-Dencun environment (testnet), we didn’t need to verify it again on a different post-Dencun environment. We assumed that we couldn’t have been missing any of the Dencun header changes since we reviewed and merged an explicit PR that added it to mainnet, so the only remaining risk was that the same Dencun code would work on Sepolia but not mainnet.

In the future, we’ll be more careful about the validation assumptions we make and more willing to perform end to end tests for sensitive environments even if they seem redundant.