Use DAS to speedup consensus throughput

ValarDragon · 9 September 2024 21:19

One of the core insights of Celestia is that DAS;ing (Data availabiity sampling) enough to reconstruct the data is much cheaper than gossiping the full data around. This realization should be brought into consensus itself. This post sketches an outline of using DAS + direct p2p communication to lift global consensus across 100 uniform validators, with (say) 115MB/s throughput, to ~1 GB/s throughput.

This is a hasty writeup to get the idea out there, but the over-arching idea should be a massive increase in throughput. It gets Celestia consensus into the world where increasing validator counts, increases throughput. Im sure theres some p2p complexity thats relevant in parameterization thats missed here.

It also assumes we already in a world, where its possible for validators in low latency to make a proof that they have constructed a valid codeword. (e.g. just using FRI directly). It also assumes validators can communicate directly with other validators (isntead of flood gossip). It may need GPU’s on validators, but not any cost increase for light nodes or full nodes.

What we should be done imo is partition our validators, such that:

DA roots have 2 steps to finality.
We have (say) 10 subsets of validators. Each doing Tendermint today, to reach 100MB/s of throughput on data, and publishing a DA root, along with a FRI proof that the root is of an erasure code.
Each subset of validators publishes their view of their DA root via vote extensions.
After a DA root is published by one validator subset, the other 9 subsets DAS sample it
Upon receiving samples for another DA root, a validator communicates this to the entire Celestia network (via vote extensions)
Once we know that under standard 33% BFT assumptions, the entire Celestia validator set can reconstruct any “finalized DA root”, we consider that root final. In other words, there is no 1/3rd byzantine subset, who can convince you the network has the data, but actually cannot reconstruct it.

So what should happen is in a pipeline:

At the beginning of block N, each 10% subset publishes a DA root, and gets it included as a candidate using vote extensions
During block N voting time / block N+1 proposal time, every validator DAS samples the other shares.
In block N+1’s Vote’s, they indicate which DA roots they have full data for. Upon a sufficient threshold of data arriving, they are made final.

How much can we improve throughput? Lets start parameterizing. Lets say each 10% subset requires 90% of its nodes to agree on the data (from a single proposer). Lets also assume the erasure rate is low enough for now, to give every peer a unique element of the erasure code. (This is equivalent to assuming CPU is not our bottleneck, as erasure encoding is fully parallelizable via GPU for our validators) Each subset’s DA root is over SUBSET_BLOCK_SIZE bytes of true data.

I’m going to parameterize loosely for sake of posting this. You can tighten the reasoning in several ways, and get better throughput improvement bounds. (though I’m also rounding off some terms, which may matter) I’ll also assume equal stake validators for making the point.

Easiest way to do this 2/3rds dissemination, is to assume that your honest 2/3rds does not intersect with the publishing committee. So we need N bytes sent to 2/3rd’s of val’s. (Note every validator is receiving data, the subset is net sending 1.5 N bytes) This means each validator’s data overhead increases by (SUBSET_BLOCK_SIZE * (3/2) / NUM_VALIDATORs) + DA_ROOT_AUTH_DATA per subset, except for the subset they are part of the committee for. The DA_ROOT_AUTH_DATA per validator can be made to be quite small, e.g. just two MT paths for a contiguous range (in the world where every val is getting independent shares). When we get to higher “share re-use” thresholds, we can get overheads here. May need to use KZG (which necissitates a GPU on just validators) to get this proof time back to negligible, when we assume erasure encoding time overheads.

So this is effectively a total bandwidth overhead of

(NUM_SUBSETs - 1) * (SUBSET_BLOCK_SIZE * (3/2) / NUM_VALIDATORs)

At 100 equal weight validators, 10 subsets, and 100MB/s of throughput per subset, we need 13.5 additional megabytes of DA sampling data, for doing a total of 10 subsets * subset block size = 1 GB/s of DA throughput.

So this is really saying that "with 113.5 MB/s of global internet communication throughput, one more latency step, and some structured network gossip (direct peering), here is a way to achieve 1 GB/s of DA throughput. If we already assume more structured network gossip, you can get way better sounding results here, by making each subset be more geo-local, and actually sequence more data.

There is a load bearing assumption on what is any given subset’s ability to gossip data to the rest of the network. This is why I’d imagine a SUBSET_SIZE of 1 validator wouldn’t work well.

ValarDragon · 9 September 2024 21:42

in hindsight, this is actually much easier to instantiate the whole way through using KZG.

KZG already acts as a “rateless code”, making it very easy to get the property that I can give every validator a distinct share. It also is itself a proof that it is a true erasure code. I can also prove a KZG opening of N points to one validator, with constant bandwidth.

Using this would require either changing light node’s DA sampling to KZG, or one entity having to re-download all the data and re-encode it for light nodes. The latter is not too extreme, since you can have a pipeline of this happening across different mainline Celestia proposers, and working on a slower timescale (e.g. 5-30s)

evan · 26 September 2024 01:14

I like this flavor of “external” sharding!

The thing that has always puzzled me about similar designs but no sharding, is figuring out how uploading the erasure encoded block increases the max theoretical throughput when the theoretical bottleneck is the proposer’s upload. For example in solana’s turbine (different use of erasure encoding ofc, but still applies), the maximum throughput is the proposer’s bandwidth divided by 3. Same thing with VID shares or danksharding. Optimizing for the non-proposer/builder/disperser is nice, but Celestia doesn’t have to or want to do that, it wants to utilize all validators at all times.

With this flavor of sharding, I can see how we are able to get past this using sub committees!

One thing that I haven’t figured out yet, is how to get sampling for light clients in an efficient way. it seems like they’d have to sample each subcommittee or the main committee would have to create a square with new erasure encoding. naively, the first requires ~10x overhead increase for light clients, where the second requires creating more erasure data and some very clever sampling gossip magic to avoid the main committee from having to download each other’s shares. Not sure on this yet though, there might be a simple way to do this.

one of the main benefits of validators downloading all of the the data is that sampling is easy to distribute across the entire set instead of centralizing that

adizere · 31 October 2024 20:03

Coming at the problem from a different angle, the main challenge in my head is coordination of sub-committees. You want the system to be elastic, i.e., ability to add more sub-committees to scale throughput.

Maybe this is out of scope for the 1st design, and I’m missing here parts about light client verification / DAS assumptions, but some thoughts…

The shared state here that we need to coordinate on is the global DA root, the set of all sub-comittees and the partitioning of nodes into each sub-committee, and maybe protocol version and system parameters. We can use the set of all validators across all sub-committees (say the 100 nodes) as the “coordinator machine” to manage this state. The challenge with this is that’s sub-optimal bandwidth-wise and inflexible.

If we assume each sub-committee to be a correct BFT instance, we can allow each instance to progress at its own rate, completely unaware from the other sub-comittees. Each sub-committee is essentially a “DA provider”. The coordinator that puts all the DA providers together is a separate BFT instance that solely does DAS on the providers DA roots. Validators in this coordinator use VE to certify DA roots, each root corresponding to a DA provider block.

The core Celestia network of validators is this coordinator instance. To scale it up, you bring more DA provider instances, so the design is more flexible. It’s also better bandwidth use intuitively, because the core network no longer has to care about shipping high volumes of data, it just certifies via DAS that all DA provider networks have done their job. Each block in the core/coordinator network is a mix of DA roots (from various providers) that have been certified. Some DA providers may move faster, and in that case a single core block may include multiple DA roots from the same provider.

To post blobs, a user does it directly to one of the providers; or does it via the core network which redirects the post request to a DA provider instance. Data reconstruction becomes more complex in this design, and this may be the biggest trade-off here.