Summary

On March 22, 2024, the Blast Mainnet sequencer stopped producing blocks from 01:20 to 01:29 UTC. The root cause was correlated failures across multiple L1 RPC providers, preventing the sequencer from fetching the required L1 data to build new blocks.

Background

In order for the sequencer to build a new L2 block, it needs to fetch certain information from the L1, like the deposit transactions that should to be included in the next block. Generally speaking, if the sequencer is unable to receive up-to-date L1 responses, it will continue to produce L2 blocks for up to 10 minutes. After those 10 minutes, it will halt block production.

For improved availability, the Blast Mainnet sequencer uses multiple L1 RPC providers behind the scenes. When the primary provider throws too many errors, takes too long to respond to requests, or stops syncing new blocks, it’s able to failover to a different L1 RPC provider to stay online. The set of providers serving the Blast sequencer has changed over time for various reasons, but at the time of this incident, Tenderly and Alchemy were the two candidates.

An ~hour before this incident, at around 23:54 UTC, we observed correlated failures across Tenderly and Alchemy for a period of about three minutes. During this period, Tenderly had stopped syncing new blocks, and Alchemy was responding with HTTP 500s to one of the required RPC calls. We received an alert for this earlier incident at 23:55 UTC and began investigating immediately, but the issue resolved itself a minute or two later. Because the L1 provider downtime was less than 10 minutes, the sequencer was still able to produce blocks throughout this period.

Root cause

Unfortunately, the exact same issue from 23:54 UTC happened again less than 90 minutes later. From 01:20 to 01:29 UTC, Tenderly stopped syncing new blocks and Alchemy started returning HTTP 500s for some of their responses.

We began investigating this incident at 01:23 UTC and were surprised to find that although both L1 providers had only been down for 3 minutes, the sequencer had already stopped producing blocks. This caused us to worry that there was some deeper issue going on, so instead of quickly swapping providers and restoring service within a couple minutes, a process we’ve practiced and timed in the past, we continued our investigation to properly assess the situation.

By 01:29 UTC, we confirmed the underlying issue was purely L1 provider related and started a pipeline to swap to a third L1 RPC provider. Coincidentally, the original issue with the upstream providers also resolved itself around this time.

On further inspection, we realized that due to certain idiosyncratic error patterns — specific types of RPC calls consistently succeeding or failing — the sequencer was entering a code path that prevented it from using old L1 blocks to build new L2 blocks. If the errors had occurred with a bit more randomness, then there would not have been an issue, and block production would have continued for another 10 minutes, using old L1 data. However, due to this specific and strange partial failure mode, block production stalled instead.

Learnings & Changes

Decorrelating L1 provider failures

Firstly, we’ve added a third provider to the rotation to reduce the chance of this happening again in the short term.

Longer term, we’ve been actively discussing L1 provider stability internally for the last two weeks. Although improvements in this area were already a priority, the incident today highlighted the importance of planning for correlated provider failures. Moving forward, we’ll consider things like client diversity, cloud provider, and system architecture more carefully when selecting which providers we include in our list of L1 backends.

Sequencer should be more robust to L1 error patterns

We’re going to continue to investigate the specific error behavior that caused the sequencer to prematurely stop block production. With some minor tweaks to the L1 provider error handling, it should be possible to improve the sequencer’s robustness in the face of this specific L1 error pattern.