CIP: Coordinated Network Upgrades

cmwaters · 7 December 2023 21:36

This post is to discuss the CIP on coordinated network upgrades which can be found here: https://github.com/celestiaorg/CIPs/blob/main/cips/cip-10.md

STB · 8 December 2023 18:37

Thanks @cmwaters for the CIP!

Mind elaborating whether the 5/6 quorum is sufficient if observed at a certain height before the upgrade height? For example, if at height x < upgrade_height the quorum is achieved, but at a later height y<upgrade_height the quorum is not maintained (due to voting power change), does the upgrade still proceed as planned at upgrade_height?

Also, can the upgrade occur at an earlier height z (z< upgrade_height) upon receiving a MsgTryUpgrade for that height z?

evan · 8 December 2023 19:55

I’d defer to Callum @cmwaters for final confirmation, but

does the upgrade still proceed as planned at upgrade_height?

Good quesiton! If the voting power changes after 5/6th have signalled, and there is no longer 5/6 quorum, then the upgrade will no longer be able to be triggered. The logic can be found on this line.

Also, can the upgrade occur at an earlier height z (z< upgrade_height) upon receiving a MsgTryUpgrade for that height z?

what do we mean z here? I’m unaware of any minimum upgrade height, but could definitely be missing something .afaiu, the upgrade can occur at any height provided there is sufficient signal

STB · 8 December 2023 21:31

Good quesiton! If the voting power changes after 5/6th have signalled, and there is no longer 5/6 quorum, then the upgrade will no longer be able to be triggered. The logic can be found on this line .

Thanks a lot, Evan, for shedding light on this.

Also, can the upgrade occur at an earlier height z (z< upgrade_height) upon receiving a MsgTryUpgrade for that height z?

what do we mean z here? I’m unaware of any minimum upgrade height, but could definitely be missing something .afaiu, the upgrade can occur at any height provided there is sufficient signal

My question applies to the next version where it appears that the upgrade height is predetermined. Will there be a signaling protocol in place for this new version? Specifically, what would occur if a MsgTryUpgrade is received at an earlier height, let’s call it ‘z’, and a quorum for the next version is confirmed, but ‘z’ is less than the set upgrade_height? In this scenario, does the upgrade process have to wait until the chain reaches the predetermined upgrade_height?

evan · 8 December 2023 22:42

Will there be a signaling protocol in place for this new version?

ahh I see! there will not be an onchain protocol in place no. all signalling and height determination will be completely offchain.

cmwaters · 11 December 2023 16:32

The set upgrade_height is specifically for v2. If a user sets that flag while the network is on v2, it will simply be ignored. In v3, the flag will be removed. It’s also not possible to downgrade in any case.

The messages MsgSignalVersion and MsgTryUpgrade only are recognised in v2, not in v1, so submitting them is not possible (or more accurately is ignored if it is submitted).

STB · 11 December 2023 16:35

Appreciate the explanation @evan and @cmwaters very helpful!

musalbas · 11 December 2023 19:49

So the idea is that at v3, upgrade height isn’t fixed but will happen immediately when there’s a crank transaction right?

cmwaters · 11 December 2023 21:11

Yea correct. There will likely be some daemon tool that just monitors the chain for quorum then at that moment submits the crank message to upgrade the chain

musalbas · 12 December 2023 07:51

Got it. What’s the rationale for having a crank message, compared to, e.g. automatically running the upgrade logic once 5/6 have signalled?

cmwaters · 12 December 2023 09:55

It’s done as a gas optimisation. We have to tally the votes according the voting power of each validator to understand whether 5/6ths have signalled. Rather than doing this calculation every height (or n heights), it’s more efficient from a computation perspective if a single node does the calculation and then submits the message and pays for the gas (generally I think we should reduce usage of EndBlock because there’s no one paying for that computation).

You might ask, why not tracks the diffs in voting power, but this is difficult when you are combining that with validators changing the version they are signalling for (you basically need to keep track of each validators voting power). You also need to know whether they are in the set or have been kicked out. So while there are some hooks available, it really comes down to legacy implementation that makes this not viable. Plus given that voting power changes almost every height it might be more computationally intense to track voting power diffs then to do a one off tally.

HoytRen · 14 December 2023 08:51

I don’t like this optimization. I believe some costs for these important events are necessary, and I believe the chain should be maintained by its client binary only without any centralized third party when it’s possible and not ridiculous. The nodes in consensus always know who is in the active set, then it’s not necessary to have an external tracker to lower the cost.

In my opinion, the upgrade process should be simple:
An operator upgrades the binary of his node, then the node whispers a message when it starts to say it is ready to upgrade the network at X height, at the same time it gathers the whispers from the network, to see if there are enough quorum. when the node gets enough signatures before X (I also don’t understand why the upgrade process is relative to the voting power), it schedules for upgrading the network. when the height reaches X, the node checks if the condition is still satisfied to process the final action.

@evan, I believe the logic should be improved. For example, currently, we have 100 validators, so 5/6 is about 85, when there are 85 signatures, the node should schedule the upgrade. but the action should not be canceled when the active signed validators drop to 84, we must pick a lower value, 80 for example to cancel the action. There must be a gap or your system will never be stable.

@cmwaters, I also feel the text of the CIP needs to be improved. for example “Once a quorum of 5/6 has signalled the same version, the network will migrate to that version.” sounds misleading (migrate immediately?).

cmwaters · 14 December 2023 15:57

Submission of this message is permissionless. There does not need to be reliance on a third party. Anyone can do it, including the validators if they wish.

The problem with having the validators read the tally and if it goes over the threshold proposing a block with the next version, is that the upgrade really needs to be agreed upon the height before so any migrations between height h and height h + 1 can occur. Also any p2p based system makes it easy to equivocate. Having it on-chain adds visibility and thus accountability to the actions of the validators.

There must be a gap or your system will never be stable.

Say there are 5/6 signalling and the network upgrades, if the number then were to fall below the quorum, the chain wouldn’t downgrade. In fact, at the moment, it can’t downgrade. And only 2/3 are required to commit the following block so there is already a tolerance of 1/6.

I also feel the text of the CIP needs to be improved

Happy to adjust it. Is it just that sentence that you felt misleading.

HoytRen · 15 December 2023 09:21

I read the CIP again, and I still have questions.

I agree that visualizing the progress is important. but most users will rely on explorer and anyone could query their light node about the state if they don’t trust explorer (we just need an API), on the other hand whispering should have signatures too, then the message may not be necessary. If you believe an on-chain message is necessary, I will not argue that, as it’s not expensive. But I disagree with how these messages work. If the validators could signal the current version and the result is determined by voting power, this in fact becomes a voting governance, and I believe this should be done by social consensus.

Weirdly, we could upgrade the network without upgrading the binary of the client, so a special message to notify the proposal of the height of the upgrade seems unnecessary. I’m thinking about a proposal to put these key-information on-chain too but I don’t think it’s urgent because social consensus works well so far. It is almost like a documentary for me, and there should be a unified form for all but not one for upgrade network, one for active a CIP, etc…

I believe the upgrade is some sort of fork, so voting power means nothing here, as long as enough validators agree with the upgrade by social consensus, the action should be taken. If we can’t reach 2/3 of the voting power in the new fork, it means the consensus failed, then why do we allow a failed chain to continue? The one cause this situation must be slashed by social consensus so that the chain keeps going.

We will not upgrade the network frequently, and the binary upgrade is always necessary, then the skip version problem should not exist practically. This is generally a QA problem of version management. Traditional software suppliers like SAP supported their customers with tens of different versions without any problem 10 years ago, then I don’t think we will have a problem today.

let’s check the text issues later and focus on the logic for now.

evan · 17 December 2023 23:35

we could upgrade the network without upgrading the binary of the client

… and the binary upgrade is always necessary …

I’m confused by these statements. The software must change in order for the upgrade to occur, no? Are you saying that we have to change the binary or that we don’t have to? perhaps something else?

If the validators could signal the current version and the result is determined by voting power, this in fact becomes a voting governance, and I believe this should be done by social consensus.

If we can’t reach 2/3 of the voting power in the new fork, it means the consensus failed, then why do we allow a failed chain to continue? The one cause this situation must be slashed by social consensus so that the chain keeps going.

The point of the signalling mechanism is to avoid accidently getting in situations where the chain halts, to stop a single validator from being able to easily trigger that situation, and to avoid having to pick an upgrade height. If the validators don’t follow social consensus, then they will still be slashed socially. The most important thing is the social contract, which is still very much to abide by social consensus and not token voting. Since the version is committed to and checked upon the verification of each header, there is no way for validators upgrade and full nodes or light clients not also see the version change. Meaning that if light clients see a version they they don’t expect or agree with, they will halt and the chain from their perspective does not continue. That would be the point where social slashing would occur.

We will not upgrade the network frequently, and the binary upgrade is always necessary, then the skip version problem should not exist practically.

I might be misunderstanding a portion of this argument as I’m not sure what is meant by “skip version problem”. Do you mean that we don’t need to increment the version or commit to it in the header?

This is generally a QA problem of version management. Traditional software suppliers like SAP supported their customers with tens of different versions without any problem 10 years ago, then I don’t think we will have a problem today.

Is this argument comparing non-deterministic centralized software providers with a decentralized BFT network? If so, do you think this is a relevant comparison in this context?

HoytRen · 18 December 2023 06:31

Let me explain.

here is a ‘Weirdly’ at the beginning. I mean upgrading the network without upgrading the binary is impractical even if it’s possible, by these 2 sentences.
here I mean we don’t need to worry about 2/3 of the voting power because we only take the upgrading procedure after reaching the social consensus. if a powerful validator publicly said he will upgrade, but he doesn’t take action at the right time, and causes the halt of the chain, he must be slashed. On the other hand, the chain should be halted if the upgrading gets something wrong, and not only the new chain but the old chain should not continue too from the view of functional nodes. Then we don’t need to on-chain-check voting power when we prepare for upgrading.
this refers to “any migrations between height h and height h + 1 can occur”. I may misunderstand the sentence too. I think the binary already specified the migrations, there should be nothing out of social consensus.
here I’m specifically talking about upgrading and managing the binary of the client, no matter if it’s centralized or decentralized, there isn’t big difference. Simply put, I believe we don’t need the compatibility of the wrong client on other versions, just ignore or slash them if they do.

by the way, what’s the goal of “avoid having to pick an upgrade height.” ? I don’t feel a determined height is bad.

evan · 2 January 2024 14:29

I mean upgrading the network without upgrading the binary is impractical even if it’s possible, by these 2 sentences.

are you refering to using a single binary to upgrade? If so, I don’t see this as impractical since ethereum and bitcoin do this. It certainly is more difficult, but we are already doing this work to be able to sync from scratch using a single binary

if a powerful validator publicly said he will upgrade, but he doesn’t take action at the right time, and causes the halt of the chain, he must be slashed.

this mechanism just favors liveness by not uneccessarily halting the chain. slashing would still occur in this example, there’s just no reason to halt the chain. Celestia has to optimize for not halting since there are so many chains built on top.

Not sure I understand point 3, do you mind rephrasing that question/point? migrations are decided by upgrades, which are decided by social consensus

The upgrade height must be determined someway, but there’s just no reason to determine it manually when validators can signal when they’re ready. it would be possible to schedule it to a more convienient time, but we still want to avoid halting if all of the validators don’t upgrade

HoytRen · 4 January 2024 08:47

you see, the main difference between our opinions is “the importance of liveness”. I agree that it’s important because so many things will build on celestia, but I disagree with how we achieve it. Even if I’m a programmer too, I strongly disagree with a technical fallback. Let me explain what I mean “the social consensus is first”. If the failed upgrade could be tolerated, it still must be confirmed by social consensus case by case. When an upgrade is important, we must put more resources and time into achieving social consensus before we could implement it, then once a decision is made by social consensus, everyone must obey, or he should be exiled. If exile someone makes the chain crash, it means the chain is never safe, and there should never be so many things built on it, that’s why decentralization is important. Even if you create the logic that fallback the upgrade so that the old chain continues to run, but 30% of people or
more disagree with it, then what’s the meaning of it? that only causes problems because somebody may not realize the failure and still interact with it, and then new benefits bound to these interactions, and people never like to give it up. I prefer solving the problem as soon as possible to an unexpected fallback. Or a lot of people could be coerced by powerful validators. We should redistribute the value that is held by the malicious validators so that the cost of society could be compensated.

HoytRen · 4 January 2024 08:54

In fact, I believe the failure of upgrade should never be tolerated because we achieved social consensus already. Technically, ETH putting so many resources into testnet isn’t a bad idea, or ETH should not exist anymore. This isn’t realistic.

We put so much effort into notifying the light nodes that the chain has problems. But now we allow a failed chain to continue, this has no logic.

cmwaters · 5 January 2024 02:28

I feel as if you view this proposal as an alternative to the mechanics of social consensus itself. Social consensus will always exist and take precedence. This proposal isn’t about how a community decides to upgrade, it’s about when they have decided, how is this coordinated in a way that attempts to minimize the downtime of the network. Regardless of how this is coordinated, the community will always have the option to fork and to choose which chain they believe to be the canonical one.

Even if you create the logic that fallback the upgrade so that the old chain continues to run, but 30% of people or
more disagree with it, then what’s the meaning of it?

The proposal isn’t about a fallback mechanism that ensures that the chain is still running, it’s about proactively ensuring that a sufficient amount of nodes are prepared for the upgrade such that when it happens the chain can continue.

If the community disagrees with the upgrade, they merely fork out the validators that chose the upgrade and continue running their version.