> *Note*: Because *all* shards distribute payload defined in [14/WAKU2-MESSAGE](spec/14/) via [protocol buffers](https://developers.google.com/protocol-buffers/),
the pubsub topic name does not explicitly add `/proto` to indicate protocol buffer encoding.
The value is comprised of a two-byte shard cluster index in network byte order concatenated with a 128-byte wide bit vector.
The bit vector indicates which shards of the respective shard cluster the node is part of.
The right-most bit in the bit vector represents shard `0`, the left-most bit represents shard `1023`.
The representation in the ENR is inspired by [Ethereum shard ENRs](https://github.com/ethereum/consensus-specs/blob/dev/specs/altair/validator.md#sync-committee-subnet-stability)),
and [this](https://github.com/ethereum/consensus-specs/blob/dev/specs/altair/validator.md#sync-committee-subnet-stability)).
> *Note:* Automatic sharding is not yet part of this specification.
This section merely serves as an outlook.
A specification of automatic sharding will be added to this document in a future version.
Automatic sharding is a method for scaling Waku relay in the number of (smaller) content topics.
It automatically maps Waku content topics to pubsub topics.
Clients and protocols building on Waku relay only see content topics, while Waku relay internally manages the mapping.
This provides both scaling as well as removes confusion about content and pubsub topics on the consumer side.
From an app point of view, a subscription to a content topic `waku2/xxx` using automatic sharding would look like:
`subscribe("/waku2/xxx", auto=true)`
The app is oblivious to the pubsub topic layer.
(Future versions could deprecate the default pubsub topic and remove the necessity for `auto=true`.)
*The basic idea behind automatic sharding*:
Content topics are mapped using [consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing).
Like with DHTs, the hash space is split into parts,
each covered by a Pubsub topic (mesh network) that carries content topics which are mapped into the respective part of the hash space.
There are (at least) two issues that have to be solved: *Hot spots* and *Discovery* (see next subsection).
Hot spots occur (similar to DHTs), when a specific mesh network becomes responsible for (several) large multicast groups (content topics).
The opposite problem occurs when a mesh only carries multicast groups with very few participants: this might cause bad connectivity within the mesh.
Our research goal here is finding efficient ways of distribution.
We could get inspired by the DHT literature.
We also have to consider:
If a node is part of many content topics which are all spread over different shards,
the node will potentially be exposed to a lot of network traffic.
## Discovery
For the discovery of automatic shards this document specifies two methods (the second method will be detailed in a future version of this document).
The first method uses the discovery introduced above in the context of static shards.
The index range `49152 - 65535` is reserved for automatic sharding.
Each index can be seen as a hash bucket.
Consistent hashing maps content topics in one of these buckets.
The second discovery method will be a successor to the first method,
but is planned to preserve the index range allocation.
Instead of adding the data to the ENR, it will treat each array index as a capability,
which can be hierarchical, having each shard in the indexed shard cluster as a sub-capability.
When scaling to a very large number of shards, this will avoid blowing up the ENR size, and allows efficient discovery.
We currently use [33/WAKU2-DISCV5](https://rfc.vac.dev/spec/33/) for discovery,
which is based on Ethereum's [discv5](https://github.com/ethereum/devp2p/blob/master/discv5/discv5.md).
While this allows to sample nodes from a distributed set of nodes efficiently and offers good resilience,
it does not allow to efficiently discover nodes with specific capabilities within this node set.
Our [research log post](https://vac.dev/wakuv2-apd) explains this in more detail.
Adding efficient (but still preserving resilience) capability discovery to discv5 is ongoing research.
[A paper on this](https://github.com/harnen/service-discovery-paper) has been completed,
but the [Ethereum discv5 specification](https://github.com/ethereum/devp2p/blob/master/discv5/discv5-theory.md)
has yet to be updated.
When the new capability discovery is available,
this document will be updated with a specification of the second discovery method.
The transition to the second method will be seamless and fully backwards compatible because nodes can still advertise and discover shard memberships in ENRs.
# Security/Privacy Considerations
See [45/WAKU2-ADVERSARIAL-MODELS](/spec/45), especially the parts on k-anonymity.
We will add more on security considerations in future versions of this document.
## Receiver Anonymity
The strength of receiver anonymity, i.e. topic receiver unlinkablity,
depends on the number of content topics (`k`) that get mapped onto a single pubsub topic (shard).
For *named* and *static* sharding this responsibility is at the app protocol layer.
Until automatic sharding is fully specified, (smaller) Apps SHOULD use the default PubSub topic unless there is a good reason not to,
e.g. a requirement to scale to large user numbers (in a rollout phase, the default pubsub topic might still be the better option).
Using a single PubSub topic ensures a connected network, as well as some degree of metadata protection.
See [section on Anonymity/Unlinkability](/spec/10/#anonymity--unlinkability).
Using another pubsub topic might lead to
- weaker metadata protection
- connectivity problems if there are not enough nodes within the respective pubsub mesh
- store nodes might not store messages for the chosen pubsub topic
Apps that use named (not the default) or static sharding likely have to setup their own infrastructure nodes which may render the application less robust.