In the past few hours a critical security vulnerability and patch were disclosed on Solana, this public disclosure occured after a supermajority of stake had already been patched to protect the network. Let's look at how this process unfolded and how 70% of stake were able to be patched before the public disclosure:
First outreach
We were first contacted by multiple different members of the Solana Foundation via various different communication platforms via private message, these are people we know personally and have communicated with before, therefore their identity is known and verified to us.
The first message was received on Wednesday, 7 August 2024 at 14:56 UTC, advising of an upcoming critical patch and sharing a hashed message confirming the date and unique identifier of the incident, the hash shared in this message was published by multiple prominent members of Anza, Jito and Solana Foundation on Twitter/X, Github and even Linkedin in order to confirm the veracity of the message.
The message provided a specific date and time at which to expect receipt of the patch in order to urgently apply this to Mainnet nodes to protect the network, as the patch itself discloses the vulnerability, time therefore being critical once it is first circulated.
Follow-up
During the next 24 hours several other core members reached out to confirm readiness and reiterate the need for urgency and confidentiality.
Patch time
At the pre-determined time, 14:00 UTC yesterday, 8 August 2028, we received a further message, again from two separate Solana Foundation members, containing instructions on downloading, verifying and applying the patch. The patch itself was hosted on the Github repository of a known Anza engineer.
The instructions included verification of the downloaded patch files against supplied shasums and could be manually introspected for their code changes as well.
At no point were operators asked to run a closed-source or private binary.
Through extensive and ongoing outreach by the Solana Foundation, Anza, Jito and others within several hours a superminority and soon thereafter a supermajority (66.66% of stake) was patched. Once 70% was patched the network was ostensibly safe and the existence of the vulnerability and the patch were disclosed in public with a call for all remaining operators to upgrade.
How do you contact validators in a decentralized network?
This question has arisen but it's really not that complicated. Most validators are active on Discord, many are also active in various Telegram groups, we interact on Twitter/X and might even know Anza or Foundation employees personally from Breakpoint etc. It's tedious but not difficult to DM validators in order to pass on such messages, especially with a group of 5-8 core people all participating in this outreach.
The key is to manage to contact enough stake to protect the network while retaining confidentiality. The amazing thing about Solana's validator community is that it's very active and engaged, and even if you don't directly know a validator they're often only one degree of separation away as we've all made friends with others over the years.
The added complexity with this sort of exercise is that operators exist in all time zones, e.g. for us the patch release was at 2am in the morning. Ultimately this is the sort of thing that happens in a complex computing environment, the existence of a vulnerability is not a concern but the response matters, the fact this was caught and safely resolved in a timely manner speaks volumes to the ongoing high quality engineering efforts that are often not visible to the public, by Anza and Foundation engineers but also engineers at Jump/Firedancer, Jito and all the other core contributing teams.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
forking event - this was the initial tweet I and others put out, because this is what we observed, excessive forking. This is however a symptom, not a cause.
We don't know the exact root cause yet, but what we know is that over the past week all validators were encouraged to upgrade from 1.13.x to 1.14.16 which is a big jump with a lot of code changes.
We know that in the 6-12 hours prior to the start of today's event the validators running 1.14.x versions cumulatively achieved a supermajority of stake (>66.6%).
We also know that there was a "fat block" propagated around the time of the start of this event.
Interested in understanding the hardware requirements of Solana validators better?
Will break down the various components and why they matter in this thread
If you want to run a validator or RPC node reach out to @latitudesh (or direct to @guisalberto) for some good deals
🧵
There's always much talk about the hardware requirements for Solana validators.
Why are they so high? Simply put: because Solana wants to achieve the highest throughput that is technologically possibly given the best computing and networking resources.
At any given moment the network is processing between 2000-4000 transactions a second, which need to be distributed, executed and committed between thousands of nodes in a matter of milliseconds.
Thoughts on the validator that caused this outage.
🧵
Hopefully it was unintentional, and ultimately it was a code bug that didn't correctly handle the duplicate block scenario (most of the times it is handled correctly but this was a very rare edge case).
A validator ran two instances and propagated two conflicting blocks.
This caused a fork where half the cluster saw and agreed to one block, the other half saw duplicates & created a new fork, skipping those.
But the majority was already building off one of the original blocks. Code bug prevented them switching to the winning fork.
The Solana Mainnet cluster has successfully restarted a few minutes ago, new blocks are being finalized.
So what exactly happened? We don't know everything yet, but here's some of what we do know
🧵
At approximately 2241 UTC on 30 September many validators stopped voting.
Many operators were immediately alerted and converged on Discord to assess the situation.
It was not immediately clear what was happening, it appeared the network halted but there were also signs it hadn't
Some conflicting reports indicated new roots were still being made (a root is a slot that is finalized, i.e. voted upon by a supermajority and 32 slots after it are too).
New roots = confirmed blocks, transactions being processed.
Go look at their APY, we calculate a TrueAPY by excluding undelegated amounts such as rent & looking at accurate epoch duration
Not brand new but don't forget you can use Stakewiz to manage all your stake accounts, including delegating, undelegating and closing, reclaiming your stake and rent!