This post is to discuss the CIP on coordinated network upgrades which can be found here: https://github.com/celestiaorg/CIPs/blob/main/cips/cip-10.md
Thanks @cmwaters for the CIP!
Mind elaborating whether the 5/6 quorum is sufficient if observed at a certain height before the upgrade height? For example, if at height x < upgrade_height the quorum is achieved, but at a later height y<upgrade_height the quorum is not maintained (due to voting power change), does the upgrade still proceed as planned at upgrade_height?
Also, can the upgrade occur at an earlier height z (z< upgrade_height) upon receiving a MsgTryUpgrade for that height z?
Iâd defer to Callum @cmwaters for final confirmation, but
does the upgrade still proceed as planned at upgrade_height?
Good quesiton! If the voting power changes after 5/6th have signalled, and there is no longer 5/6 quorum, then the upgrade will no longer be able to be triggered. The logic can be found on this line.
Also, can the upgrade occur at an earlier height z (z< upgrade_height) upon receiving a MsgTryUpgrade for that height z?
what do we mean z here? Iâm unaware of any minimum upgrade height, but could definitely be missing something .afaiu, the upgrade can occur at any height provided there is sufficient signal
Good quesiton! If the voting power changes after 5/6th have signalled, and there is no longer 5/6 quorum, then the upgrade will no longer be able to be triggered. The logic can be found on this line .
Thanks a lot, Evan, for shedding light on this.
Also, can the upgrade occur at an earlier height z (z< upgrade_height) upon receiving a MsgTryUpgrade for that height z?
what do we mean z here? Iâm unaware of any minimum upgrade height, but could definitely be missing something .afaiu, the upgrade can occur at any height provided there is sufficient signal
My question applies to the next version where it appears that the upgrade height is predetermined. Will there be a signaling protocol in place for this new version? Specifically, what would occur if a MsgTryUpgrade is received at an earlier height, letâs call it âzâ, and a quorum for the next version is confirmed, but âzâ is less than the set upgrade_height? In this scenario, does the upgrade process have to wait until the chain reaches the predetermined upgrade_height?
Will there be a signaling protocol in place for this new version?
ahh I see! there will not be an onchain protocol in place no. all signalling and height determination will be completely offchain.
The set upgrade_height is specifically for v2. If a user sets that flag while the network is on v2, it will simply be ignored. In v3, the flag will be removed. Itâs also not possible to downgrade in any case.
The messages MsgSignalVersion
and MsgTryUpgrade
only are recognised in v2, not in v1, so submitting them is not possible (or more accurately is ignored if it is submitted).
So the idea is that at v3, upgrade height isnât fixed but will happen immediately when thereâs a crank transaction right?
Yea correct. There will likely be some daemon tool that just monitors the chain for quorum then at that moment submits the crank message to upgrade the chain
Got it. Whatâs the rationale for having a crank message, compared to, e.g. automatically running the upgrade logic once 5/6 have signalled?
Itâs done as a gas optimisation. We have to tally the votes according the voting power of each validator to understand whether 5/6ths have signalled. Rather than doing this calculation every height (or n heights), itâs more efficient from a computation perspective if a single node does the calculation and then submits the message and pays for the gas (generally I think we should reduce usage of EndBlock because thereâs no one paying for that computation).
You might ask, why not tracks the diffs in voting power, but this is difficult when you are combining that with validators changing the version they are signalling for (you basically need to keep track of each validators voting power). You also need to know whether they are in the set or have been kicked out. So while there are some hooks available, it really comes down to legacy implementation that makes this not viable. Plus given that voting power changes almost every height it might be more computationally intense to track voting power diffs then to do a one off tally.
I donât like this optimization. I believe some costs for these important events are necessary, and I believe the chain should be maintained by its client binary only without any centralized third party when itâs possible and not ridiculous. The nodes in consensus always know who is in the active set, then itâs not necessary to have an external tracker to lower the cost.
In my opinion, the upgrade process should be simple:
An operator upgrades the binary of his node, then the node whispers a message when it starts to say it is ready to upgrade the network at X height, at the same time it gathers the whispers from the network, to see if there are enough quorum. when the node gets enough signatures before X (I also donât understand why the upgrade process is relative to the voting power), it schedules for upgrading the network. when the height reaches X, the node checks if the condition is still satisfied to process the final action.
@evan, I believe the logic should be improved. For example, currently, we have 100 validators, so 5/6 is about 85, when there are 85 signatures, the node should schedule the upgrade. but the action should not be canceled when the active signed validators drop to 84, we must pick a lower value, 80 for example to cancel the action. There must be a gap or your system will never be stable.
@cmwaters, I also feel the text of the CIP needs to be improved. for example âOnce a quorum of 5/6 has signalled the same version, the network will migrate to that version.â sounds misleading (migrate immediately?).
Submission of this message is permissionless. There does not need to be reliance on a third party. Anyone can do it, including the validators if they wish.
The problem with having the validators read the tally and if it goes over the threshold proposing a block with the next version, is that the upgrade really needs to be agreed upon the height before so any migrations between height h and height h + 1 can occur. Also any p2p based system makes it easy to equivocate. Having it on-chain adds visibility and thus accountability to the actions of the validators.
There must be a gap or your system will never be stable.
Say there are 5/6 signalling and the network upgrades, if the number then were to fall below the quorum, the chain wouldnât downgrade. In fact, at the moment, it canât downgrade. And only 2/3 are required to commit the following block so there is already a tolerance of 1/6.
I also feel the text of the CIP needs to be improved
Happy to adjust it. Is it just that sentence that you felt misleading.
I read the CIP again, and I still have questions.
I agree that visualizing the progress is important. but most users will rely on explorer and anyone could query their light node about the state if they donât trust explorer (we just need an API), on the other hand whispering should have signatures too, then the message may not be necessary. If you believe an on-chain message is necessary, I will not argue that, as itâs not expensive. But I disagree with how these messages work. If the validators could signal the current version and the result is determined by voting power, this in fact becomes a voting governance, and I believe this should be done by social consensus.
Weirdly, we could upgrade the network without upgrading the binary of the client, so a special message to notify the proposal of the height of the upgrade seems unnecessary. Iâm thinking about a proposal to put these key-information on-chain too but I donât think itâs urgent because social consensus works well so far. It is almost like a documentary for me, and there should be a unified form for all but not one for upgrade network, one for active a CIP, etcâŚ
I believe the upgrade is some sort of fork, so voting power means nothing here, as long as enough validators agree with the upgrade by social consensus, the action should be taken. If we canât reach 2/3 of the voting power in the new fork, it means the consensus failed, then why do we allow a failed chain to continue? The one cause this situation must be slashed by social consensus so that the chain keeps going.
We will not upgrade the network frequently, and the binary upgrade is always necessary, then the skip version problem should not exist practically. This is generally a QA problem of version management. Traditional software suppliers like SAP supported their customers with tens of different versions without any problem 10 years ago, then I donât think we will have a problem today.
letâs check the text issues later and focus on the logic for now.
we could upgrade the network without upgrading the binary of the client
⌠and the binary upgrade is always necessary âŚ
Iâm confused by these statements. The software must change in order for the upgrade to occur, no? Are you saying that we have to change the binary or that we donât have to? perhaps something else?
If the validators could signal the current version and the result is determined by voting power, this in fact becomes a voting governance, and I believe this should be done by social consensus.
If we canât reach 2/3 of the voting power in the new fork, it means the consensus failed, then why do we allow a failed chain to continue? The one cause this situation must be slashed by social consensus so that the chain keeps going.
The point of the signalling mechanism is to avoid accidently getting in situations where the chain halts, to stop a single validator from being able to easily trigger that situation, and to avoid having to pick an upgrade height. If the validators donât follow social consensus, then they will still be slashed socially. The most important thing is the social contract, which is still very much to abide by social consensus and not token voting. Since the version is committed to and checked upon the verification of each header, there is no way for validators upgrade and full nodes or light clients not also see the version change. Meaning that if light clients see a version they they donât expect or agree with, they will halt and the chain from their perspective does not continue. That would be the point where social slashing would occur.
We will not upgrade the network frequently, and the binary upgrade is always necessary, then the skip version problem should not exist practically.
I might be misunderstanding a portion of this argument as Iâm not sure what is meant by âskip version problemâ. Do you mean that we donât need to increment the version or commit to it in the header?
This is generally a QA problem of version management. Traditional software suppliers like SAP supported their customers with tens of different versions without any problem 10 years ago, then I donât think we will have a problem today.
Is this argument comparing non-deterministic centralized software providers with a decentralized BFT network? If so, do you think this is a relevant comparison in this context?
Let me explain.
-
here is a âWeirdlyâ at the beginning. I mean upgrading the network without upgrading the binary is impractical even if itâs possible, by these 2 sentences.
-
here I mean we donât need to worry about 2/3 of the voting power because we only take the upgrading procedure after reaching the social consensus. if a powerful validator publicly said he will upgrade, but he doesnât take action at the right time, and causes the halt of the chain, he must be slashed. On the other hand, the chain should be halted if the upgrading gets something wrong, and not only the new chain but the old chain should not continue too from the view of functional nodes. Then we donât need to on-chain-check voting power when we prepare for upgrading.
-
this refers to âany migrations between height h and height h + 1 can occurâ. I may misunderstand the sentence too. I think the binary already specified the migrations, there should be nothing out of social consensus.
-
here Iâm specifically talking about upgrading and managing the binary of the client, no matter if itâs centralized or decentralized, there isnât big difference. Simply put, I believe we donât need the compatibility of the wrong client on other versions, just ignore or slash them if they do.
by the way, whatâs the goal of âavoid having to pick an upgrade height.â ? I donât feel a determined height is bad.
I mean upgrading the network without upgrading the binary is impractical even if itâs possible, by these 2 sentences.
are you refering to using a single binary to upgrade? If so, I donât see this as impractical since ethereum and bitcoin do this. It certainly is more difficult, but we are already doing this work to be able to sync from scratch using a single binary
if a powerful validator publicly said he will upgrade, but he doesnât take action at the right time, and causes the halt of the chain, he must be slashed.
this mechanism just favors liveness by not uneccessarily halting the chain. slashing would still occur in this example, thereâs just no reason to halt the chain. Celestia has to optimize for not halting since there are so many chains built on top.
Not sure I understand point 3, do you mind rephrasing that question/point? migrations are decided by upgrades, which are decided by social consensus
The upgrade height must be determined someway, but thereâs just no reason to determine it manually when validators can signal when theyâre ready. it would be possible to schedule it to a more convienient time, but we still want to avoid halting if all of the validators donât upgrade
you see, the main difference between our opinions is âthe importance of livenessâ. I agree that itâs important because so many things will build on celestia, but I disagree with how we achieve it. Even if Iâm a programmer too, I strongly disagree with a technical fallback. Let me explain what I mean âthe social consensus is firstâ. If the failed upgrade could be tolerated, it still must be confirmed by social consensus case by case. When an upgrade is important, we must put more resources and time into achieving social consensus before we could implement it, then once a decision is made by social consensus, everyone must obey, or he should be exiled. If exile someone makes the chain crash, it means the chain is never safe, and there should never be so many things built on it, thatâs why decentralization is important. Even if you create the logic that fallback the upgrade so that the old chain continues to run, but 30% of people or
more disagree with it, then whatâs the meaning of it? that only causes problems because somebody may not realize the failure and still interact with it, and then new benefits bound to these interactions, and people never like to give it up. I prefer solving the problem as soon as possible to an unexpected fallback. Or a lot of people could be coerced by powerful validators. We should redistribute the value that is held by the malicious validators so that the cost of society could be compensated.
In fact, I believe the failure of upgrade should never be tolerated because we achieved social consensus already. Technically, ETH putting so many resources into testnet isnât a bad idea, or ETH should not exist anymore. This isnât realistic.
We put so much effort into notifying the light nodes that the chain has problems. But now we allow a failed chain to continue, this has no logic.
I feel as if you view this proposal as an alternative to the mechanics of social consensus itself. Social consensus will always exist and take precedence. This proposal isnât about how a community decides to upgrade, itâs about when they have decided, how is this coordinated in a way that attempts to minimize the downtime of the network. Regardless of how this is coordinated, the community will always have the option to fork and to choose which chain they believe to be the canonical one.
Even if you create the logic that fallback the upgrade so that the old chain continues to run, but 30% of people or
more disagree with it, then whatâs the meaning of it?
The proposal isnât about a fallback mechanism that ensures that the chain is still running, itâs about proactively ensuring that a sufficient amount of nodes are prepared for the upgrade such that when it happens the chain can continue.
If the community disagrees with the upgrade, they merely fork out the validators that chose the upgrade and continue running their version.