The internet is becoming more and more centralized. As companies and individuals increasingly rely on centralized cloud providers for storage, critical concerns on privacy, censorship, and user control, as well as on the concentration of economic power in the hands of few entities become more pronounced.
While there have been several attempts at providing alternatives, modern decentralized storage networks (DSNs) often fall short on basic aspects like having strong durability guarantees, being efficient to operate, or providing scalable proofs of storage. This in turn leads to solutions that are either: _i)_ not useful, as they can lose data; _ii)_ not friendly to decentralization, as they require specialized or expensive hardware, or; _iii)_ economically unfeasible, as they burden providers with too many costs beyond those of the storage hardware itself.
In this paper, we introduce Codex, a novel Erasure Coded Decentralized Storage Network that attempts to tackle those issues. Codex leverages erasure coding as part of both redundancy and storage proofs, coupling it with zero-knowledge proofs and lazy repair to achieve tunable durability guarantees while being modest on hardware and energy requirements. Central to Codex is the concept of the Decentralized Durability Engine (DDE), a framework we formalize to systematically address data redundancy, remote auditing, repair, incentives, and data dispersal in decentralized storage systems.
We describe the architecture and mechanisms of Codex, including its marketplace and proof systems, and provide a preliminary reliability analysis using a Continuous Time Markov-Chain (CTMC) model to evaluate durability guarantees. Codex represents a step toward creating a decentralized, resilient, and economically viable storage layer critical for the broader decentralized ecosystem.
## 1. Introduction
Data production has been growing at an astounding pace, with significant implications. Data is a critical asset for businesses, driving decision-making, strategic planning, and innovation. Individuals increasingly intertwine their physical lives with the digital world, meticulously documenting every aspect of their lives, taking pictures and videos, sharing their views and perspectives on current events, using digital means for communication and artistic expression, etc. Digital personas have become as important as their physical counterparts, and this tendency is only increasing.
Yet, the current trend towards centralization on the web has led to a situation where users have little to no control over their personal data and how it is used. Large corporations collect, analyze, and monetize user data, often without consent or transparency. This lack of privacy leaves individuals vulnerable to targeted advertising, profiling, surveillance, and potential misuse of their personal information.
Moreover, the concentration of data and power in the hands of a few centralized entities creates a significant risk of censorship: platforms can unilaterally decide to remove, modify, or suppress content that they deem undesirable, effectively limiting users’ freedom of expression and access to information. This power imbalance undermines the open and democratic nature of the internet, creating echo chambers and curtailing the free flow of ideas.
Another undesirable aspect of centralization is economical: as bigtech dominance in this space evolves to an oligopoly, all revenues flow into the hands of a selected few, while the barrier to entry becomes higher and higher.
To address these issues, there is a growing need for decentralized technologies. Decentralized technologies enable users to: _i)_ own and control their data by providing secure and transparent mechanisms for data storage and sharing, and _ii)_ participate in the storage economy as providers, allowing individuals and organizations to take their share of revenues. Users can choose to selectively share their data with trusted parties, retain the ability to revoke access when necessary, and monetize their data and their hardware if they so desire. This paradigm shift towards user-centric data and infrastructure ownership has the potential to create a more equitable and transparent digital ecosystem.
Despite their potential benefits, however, the lack of efficient and reliable decentralized storage leaves a key gap that needs to be addressed before any notion of a decentralized technology stack can be seriously contemplated.
In response to these challenges, we introduce Codex: a novel Erasure Coded Decentralized Storage Network which relies on erasure coding for redundancy and efficient proofs of storage. This method provides unparalleled reliability and allows for the storage of large datasets, larger than any single node in the network, in a durable and secure fashion. Our compact and efficient proofs of storage can detect and prevent catastrophic data loss with great accuracy, while incurring relatively modest hardware and electricity requirements -- two preconditions for achieving true decentralization. In addition, we introduce and formalize in this paper the notion of durability in decentralized storage networks through a new concept we call the _Decentralized Durability Engine_ (DDE).
The remainder of this paper is organized as follows. First, we discuss the context on which Codex is built (Sec. 2) by expanding on the issues of centralized cloud storage, and providing background on previous takes at decentralized alternatives -- namely, p2p networks, blockchains, and DSNs. Then, we introduce the conceptual framework that underpins Codex in Sec. 3 -- the Decentralized Durability Engine (DDE) -- followed by a more detailed descriptions of the mechanisms behind Codex and how it materializes as a DDE in Sec. 4. Sec. 5 then presents a preliminary reliability analysis, which places Codex's storage parameters alongside more formal durability guarantees. Finally, Sec. 6 provides conclusions and ongoing work.
## 2. Background and Context
Codex aims at being a useful and decentralized alternative to centralized storage. In this section, we discuss the context in which this needs arises, as well as why past and current approaches to building and reasoning about decentralized storage were incomplete. This will set the stage for our introduction of the Decentralized Durability Engine -- our approach to reasoning about decentralized storage -- in Sec. 3.
### 2.1. Centralized Cloud Storage
Over the past two decades, centralized cloud storage has become the _de facto_ approach for storage services on the internet for both individuals and companies alike. Indeed, recent research places the percentage of businesses that rely on at least one cloud provider at $94\%$[^zippia_cloud_report], while most modern smartphones will backup their contents to a cloud storage provider by default.
The appeal is clear: scalable, easy-to-use elastic storage and networking coupled with a flexible pay-as-you-go model and a strong focus on durability[^s3_reinvent_19] translating to dependable infrastructure that is available immediately and at the exact scale required.
Centralization, however, carries a long list of downsides, most of them due to having a single actor in control of the whole system. This effectively puts users at the mercy of the controlling actor's commercial interests, which may and often will not coincide with the user's interests on how their data gets used, as well as their ability to stay afloat in the face of natural, political, or economical adversity. Government intervention and censorship are also important sources of concern[^liu_19]. Larger organizations are acutely aware of these risks, with $98\%$ of cloud user businesses adopting multi-cloud environments to mitigate them[^multicloud].
The final downside is economical: since very few companies can currently provide such services at the scale and quality required, the billions in customer spending gets funneled into the pockets of a handful of individuals. Oligopolies such as these can derail into an uncompetitive market which finds its equilibrium at a price point which is not necessarily in the best interest of end-users[^feng_14].
### 2.2. Decentralized Alternatives: Past and Present
Given the downsides of centralized cloud storage, it is natural to wonder if there could be alternatives, and indeed those have been extensively researched since the early 2000's. We will not attempt to cover that entire space here, and will instead focus on what we consider to be the three main technological breakthroughs that happened in decentralized systems over these past two decades, and why they have failed to make meaningful inroads thus far: P2P networks, blockchains, and Data Storage Networks (DSNs).
**P2P Networks.** P2P networks have been around for over two decades. Their premise is that users can run client software on their own machines and form a self-organizing network that enables sharing of resources like bandwidth, compute, and storage to provide higher-level services like search or decentralized file sharing without the need for a centralized controlling actor.
Early networks like BitTorrent[^cohen_01], however, only had rudimentary incentives based on a form of barter economy in which nodes providing blocks to other nodes would be rewarded with access to more blocks. This provides some basic incentives for nodes to exchange the data they hold, but whether or not a node decides to hold a given piece of data is contingent on whether or not the node operator was interested in that data to begin with; i.e., a node will likely not download a movie if they are not interested in watching it.
Files which are not popular, therefore, tend to disappear from the network as no one is interested in them, and there is no way to incentivize nodes to do otherwise. This lack of even basic durability guarantees means BitTorrent and, in fact, most of the early p2p file-sharing networks, are suitable as distribution networks at best, but not as storage networks as data can, and probably will, be eventually lost. Even more recent attempts at decentralized file sharing like IPFS[^ipfs_website] suffer from similar shortcomings: by default, IPFS offers no durability guarantees, i.e., there is no way to punish a pinning service if it fails to keep data around.
**Blockchains.** Blockchains have been introduced as part of Bitcoin in 2008[^nakamoto_08], with the next major player Ethereum[^buterin_13] going live in 2013. A blockchain consists of a series of blocks, each containing a list of transactions. These blocks are linked together in chronological order through cryptographic hashes. Each block contains a hash of the previous block, which secures the chain against tampering. This structure ensures that once a block is added to the blockchain, the information it contains cannot be altered without redoing all subsequent blocks, making it secure against fraud and revisions. For all practical purposes, once a block gets added, it can no longer be removed.
This permanence allied to the fully replicated nature of blockchain means they provide very strong durability and availability guarantees, and this has been recognized since very early on. The full-replication model of blockchains, however, also turns out to be what makes them impractical for data storage: at the date of this writing, storing as little as a gigabyte of data on a chain like Ethereum remains prohibitively expensive[^kostamis_24].
Blockchains represent nevertheless a game-changing addition to decentralized systems in that they allow system designers to build much stronger and complex incentive mechanisms based on monetary economies, and to implement key mechanisms like cryptoeconomic security[^chaudhry_24] through slashing, which were simply not possible before.
**Decentralized Storage Networks (DSNs).** Decentralized Storage Networks (DSNs) represent a natural stepping stone in decentralized storage: by combining traditional P2P technology with the strong incentive mechanisms afforded by modern blockchains and new cryptographic primitives, they provide a much more credible take on decentralized storage.
Like early P2P networks, DSNs consolidate storage capacities from various independent providers and orchestrate data storage and retrieval services for clients. Unlike early P2P networks, however, DSNs employ the much stronger mechanisms afforded by blockchains to incentivize correct operation. They typically employ remote auditing techniques like Zero-Knowledge proofs to hold participants accountable, coupled with staking/slashing mechanisms which inflict monetary losses on bad participants as they are caught.
In their seminal paper[^protocol_17], the Filecoin team characterizes a DSN as a tuple $\Pi = \left(\text{Put}, \text{Get}, \text{Manage}\right)$, where:
* $\text{Put(data)} \rightarrow \text{key}$: Clients execute the Put protocol to store data under a unique identifier key.
* $\text{Get(key)} \rightarrow \text{data}$: Clients execute the Get protocol to retrieve data that is currently stored using key.
* $\text{Manage()}$: The network of participants coordinates via the Manage protocol to: control the available storage, audit the service offered by providers and repair possible faults. The Manage protocol is run by storage providers often in conjunction with clients or a network of auditors.
While useful, we argue this definition is incomplete as it pushes a number of key elements onto an unspecified black box protocol/primitive named $\text{Manage}()$. These include:
* incentive and slashing mechanisms;
* remote auditing and repair protocols;
* strategies for data redundancy and dispersal.
Such elements are of particular importance as one attempts to reason about what would be required to construct a DSN that provides actual utility. As we set out to design Codex and asked ourselves that question, we found that the key to useful DSNs is in _durability_; i.e., a storage system is only useful if it can provide durability guarantees that can be reasoned about.
In the next section, we explore a construct we name Decentralized Durability Engines which, we argue, lead to a more principled approach to designing storage systems that provide utility.
## 3. Decentralized Durability Engines (DDE)
A Decentralized Durability Engine is a tuple $\Gamma = \text{(R, A, P, I, D)}$ where:
* $R$ is a set of redundancy mechanisms, such as erasure coding and replication, that ensure data availability and fault tolerance.
* $A$ is a set of remote auditing protocols that verify the integrity and availability of stored data.
* $P$ is a set of repair mechanisms that maintain the desired level of redundancy and data integrity by detecting and correcting data corruption and loss.
* $I$ is a set of incentive mechanisms that encourage nodes to behave honestly and reliably by rewarding good behavior and penalizing malicious or negligent actions.
* $D$ is a set of data dispersal algorithms that strategically distribute data fragments across multiple nodes to minimize the risk of data loss due to localized failures and to improve data availability and accessibility.
We argue that when designing a storage system that can keep data around, none of these elements are optional. Data needs to be redundant ($R$), there needs to be a way to detect failures and misbehavior ($A$), we must be able to repair data so it is not lost to accumulated failures $(P)$, misbehaving nodes must be penalized ($I$), and data must be placed so as fault correlation is understood ($D$).
This is a somewhat informal treatment for now, but the actual parameters that would be input into any reliability analysis of a storage system would be contingent on those choices. In a future publication, we will explore how durability is affected by the choice of each of these elements in a formal framework.
## 4. Codex: A Decentralized Durability Engine
This section describes how Codex actually works. The primary motivation behind Codex is to provide a scalable and robust decentralized storage solution which addresses the limitations of existing DSNs. This includes: i) enhanced durability guarantees that can be reasoned about, ii) scalability and performance and iii) decentralization and censorship resistance.
We start this section by laying out key concepts required to understand how Codex works (Sec. 4.1). We then discuss the redundancy ($R$), remote auditing ($A$), and repair mechanisms ($P$) of Codex and how they combine erasure codes and zero-knowledge proofs into a system that is lightweight, efficient, and amenable to decentralization. Sec. 4.4 takes a detour onto the networking layer and provides an overview of our scalable data transfer protocols. Finally, incentives ($I$) and dispersal $(D)$ are discussed in Sec. 4.5 as part of the Codex marketplace.
### 4.1. Concepts
In the context of Codex (and of storage systems in general), two properties appear as fundamental:
**Availability.** A system is said to be _available_ when it is able to provide its intended service, and _unavailable_ otherwise. The availability of a system over any given interval of time is given by [^tanembaum]:
$$
p_\text{avail} =\frac{t_a}{t_a + t_u}
$$
where $t_a$ and $t_u$ are the total times in which the system remained available and unavailable, respectively. To maintain high availability, a storage system needs to be _fault tolerant_; i.e., it should be able to correctly service storage and retrieval requests in the presence of hardware faults and malicious participants.
**Durability.** Quantified as a probability $p_\text{dur} = 1 - p_\text{loss}$ that a given unit of data _will not_ be lost over a given period of time; e.g. the probability that some file is not lost within a $1$-year period. This probability is sometimes expressed as a percentage (e.g. in S3).
If this number is very close to one, e.g. $p_\text{loss} \leq 10^{-6}$, then the system is said to be _highly durable_. Systems that are not highly durable are those that can lose data with higher or unbounded probability, or that do not quantify their loss probabilities at all.
Ideally, we would like storage systems to be highly available and highly durable. Since achieving _provable availability_ is in general not possible[^bassam_18], Codex focuses on stronger guarantees for durability and on incentivizing availability instead.
**Dataset.** A _dataset_ $D = \{c_1, \cdots c_b\}$ is defined in Codex as an ordered set of $b \in \mathbb{N}$ fixed-sized blocks. Blocks are typically small, on the order of $64\text{kB}$. For all intents and purposes, one can think of a dataset as being a regular file.
**Storage Client (SC).** A Storage Client is a node that participates in the Codex network to buy storage. These may be individuals seeking to backup the contents of their hard drives, or organizations seeking to store business data.
**Storage Provider (SP).** A Storage Provider is a node that participates in Codex by selling disk space to other nodes.
### 4.2. Overview
At a high level, storing data in Codex works as follows. Whenever a SC wishes to store a dataset $D$ into Codex, it:
1. splits $D$ into $k$ disjoint $\{S_1, \cdots, S_k\}$ partitions named **slots**, where each slot contains $s = \left\lceil \frac{b}{k} \right\rceil$ blocks;
1. erasure-codes $D$ with a Reed-Solomon Code[^reed_60] by extending it into a new dataset $D_e$ which adds an extra $m \times s$ parity blocks to $D$ (Sec. 4.3). This effectively adds $m$ new slots to the dataset. Since we use a systematic code, $D$ remains a prefix of $D_e$;
1. computes two different Merkle tree roots: one used for inclusion proofs during data exchange, based on SHA256, and another one for storage proofs, based on Poseidon2 (Sec 4.3);
1. generates a content-addressable manifest for $D_e$ and advertises it into the Codex DHT (Sec. 4.4);
1. posts a **storage request** containing a set of parameters in the Codex marketplace (on-chain), which includes things like how much the SC is willing to pay for storage, as well as expectations that may impact the profitability of candidate SPs and the durability guarantees for $D_e$, for each slot (Sec. 4.5).
The Codex marketplace (Sec. 4.5) then ensures that SPs willing to store data for a given storage request are provided a fair opportunity to do so. Eventually, for each slot $S_i \in D_e$, _some_ SP will:
1. declare its interest in it by filing a **slot reservation**;
1. download the data for the slot from the SC;
1. provide an initial proof of storage and some staking collateral for it.
Once this process completes, we say that slot $S_i$ has been **filled**. Once all slots in $D_e$ have been filled, we say that $D_e$ has been **fulfilled**. For the remainder of this section, we will dive into the architecture and mechanisms of Codex by explaining in more detail each aspect of the storage flow.
### 4.3. Erasure Coding, Repair, and Storage Proofs
Erasure coding plays two main roles in Codex: _i)_ allowing data to be recovered following loss of one or more SPs and the slots that they hold (redundancy) and _ii)_ enabling cost-effective proofs of storage. We will go through each of these aspects separately.
**Erasure Coding for Redundancy.** As described before, a dataset $D$ is initially split into $k$ slots of size $s = \left\lceil \frac{b}{k} \right\rceil$ (Figure 1). Since $b$ may not actually be divisible by $k$, Codex will add _padding blocks_ as required so that the number of blocks in $D$ is $b_p = s \times k$.