CIP: Standardize data expiry time for pruned nodes

musalbas · 27 November 2023 12:25

This is a discussion thread for the following CIP: https://github.com/celestiaorg/CIPs/blob/main/cips/cip-4.md

Abstract: This CIP standardizes the default expiry time of historical blocks for pruned (non-archival) nodes to 30 days.

musalbas · 27 November 2023 12:27

@rene @Wondertan What should the libp2p topics for pruned vs non-pruned nodes be?

Continuing the discussion here from the GitHub PR:

@rene:

The pruned node topic will not be an additional topic but rather the current topic that we have which is full. The additional topic will be for archival nodes (archival).

We should call storage nodes archival to better indicate that they retain + serve historical blocks.

@musalbas:

IMO it should be the other way round: the existing topic should be for archival nodes, otherwise you will break backwards compatibility, because you will clutter the existing topic with non-archival nodes, which will not be backwards compatible with older nodes that are discovering nodes in the existing topic, and thinking they have all the historical blocks.

We could keep full for archival and introduce an additional topic such as pruned.

rene · 27 November 2023 16:52

@musalbas

The reason I proposed full as the default and archival as the new topic is because all archival are also full. “Pruned” will be the default, but both archival and pruned will serve data within sampling window.

When we introduce pruning as an experimental feature, light nodes will already be sampling only within the sampling window, meaning when they actively discover full nodes, it won’t matter whether they are pruned or archival as both will serve the data that they want.

the existing topic should be for archival nodes, otherwise you will break backwards compatibility, because you will clutter the existing topic with non-archival nodes

The existing topic exists as a mechanism for lights + fulls to discover fulls for the purpose of being able to sync recent data (within sampling window) as that is what the majority of the network cares about. That behaviour does not change whether the full topic is cluttered with nodes running old software where it retains all historical blocks. The new behaviour will be any node that actually wants to do archival sync which will only be introduced.

The only issue here is a full node running old software that wants to sync off p2p (where it expects to sync from genesis → network head) coming up online when the majority of the network is running pruned versions.

On the other hand, if we left the existing topic for archival nodes and introduced a pruned topic, we’d have to change the topic that lights and pruned fulls would optimise discovery for to pruned as it’s likely that once pruning is enabled as a default, most full nodes on the network will be pruned nodes.

distractedm1nd · 10 January 2024 10:48

I want to make an addition here:

The existing topic exists as a mechanism for lights + fulls to discover fulls for the purpose of being able to sync recent data (within sampling window) as that is what the majority of the network cares about.

The way I see the existing topic is that it is there to improve network connectivity; So that light nodes aren’t only connected to other light nodes for example. So It doesn’t actually matter if nodes advertising this topic are pruned or not, as long as they can serve recent data.

The new topic is not there to ensure such a connectivity, it is only there for the case that your node wants to retrieve historical data. Also, there will not nearly be as many archival nodes as there are full nodes, so “clutter” doesn’t really fit here. There might (probably will) be some overlap between the sets, but it doesn’t matter.

walldiss · 10 January 2024 11:57

The reason I proposed full as the default and archival as the new topic is because all archival are also full . “Pruned” will be the default, but both archival and pruned will serve data within sampling window.

Archival nodes cover a wider range of height. If there are only ‘full’ and ‘archival’ topics, for recent blocks, both types of nodes will be used equally. This would mean that archival nodes will have more incoming requests / load than pruned nodes. They will need to serve recent blocks like pruned nodes, plus old data that pruned nodes are not serving.

Perhaps it might be worth introducing a ‘pruned’ topic. It will allow prioritizing requesting recent data from pruned nodes, easing the load on archival nodes.

rene · 10 January 2024 13:38

@walldiss
Eventually, the default will be for nodes to run with pruned mode on, meaning that the majority of the network of full nodes will be pruned rather than archival. I don’t think it is worth breaking the full topic now to accommodate the potential issue that for a brief time (while pruning is still an optional feature rather than the default), archival nodes will be requested for recent blocks.

If the infrastructure costs are too high, the archival nodes can just stop advertising on the full topic.

adlerjohn · 16 January 2024 14:03

Two questions:

Is there any issue with using “30 days” instead of a more precise number of seconds? Converting days into seconds isn’t always trivial. On the other hand, if 30 days is chosen specifically to allow a huge leeway from the 21 day unbonding period, then exact-to-the-second precision may not be important.

Non-pruned nodes MAY advertise themselves under a new archival tag, in which case the nodes MUST store and distribute data in all blocks.

How does the archival tag here interplay with partial nodes? For example, I may want to have an partial archival node for state-changing transactions, that prunes blob data.

musalbas · 16 January 2024 19:28

Is there any issue with using “30 days” instead of a more precise number of seconds? Converting days into seconds isn’t always trivial. On the other hand, if 30 days is chosen specifically to allow a huge leeway from the 21 day unbonding period, then exact-to-the-second precision may not be important.

Is there any other definition of 30 days than 2592000 seconds? Should we make it explicit?

There also be potential issues where if a light node samples right on the boundary of the sampling window (when starting up), but a full node might not have it by the time they receive the request, or if their time is slightly out of sync. So maybe it’s better to keep it unprecise, because what’s important here is that’s it greater than the unbonding period, and the general retrievalability expectation is 30 days, rather than the exact number of seconds?

How does the archival tag here interplay with partial nodes? For example, I may want to have an partial archival node for state-changing transactions, that prunes blob data.

Since partial namespace nodes aren’t currently supported, I don’t think that’s relevant or in scope.

zvolin · 18 January 2024 21:08

Hey, currently nodes require specifying the trusted genesis block hash to start syncing from. Will this CIP introduce some sort of weak subjectivity checkpoints that one would need to explicitely trust in order to spin up a node?

rene · 19 January 2024 11:59

Hi @zvolin , this CIP will not impact the current behaviour which is to allow a user to either specify a weak subjectivity checkpoint on start-up or to request one from the default trusted source (hardcoded bootstrappers).

rene · 19 January 2024 12:01

This should not be an issue if we introduce a slight buffer for the full node pruning window such that it is slightly larger than network’s sampling window.

adlerjohn · 19 January 2024 14:16

Is there any other definition of 30 days than 2592000 seconds? Should we make it explicit?

Hmm I guess not. I was more thinking of things like daylight savings time, time zones, leap years, etc. Which aren’t important if the goal is to simply have it be loosely larger than the unbonding period.

But in that case it (the rationale) should really be documented in the CIP IMO, otherwise future-people aren’t going to know the rationale and maybe mess up the parameters.