feat: contract deployment and upgrades writeup (#195)

* feat: contract deployment and upgrades writeup * feat: extending the refactor section and polishing
2025-02-03 14:25:48 +00:00 · 2025-01-10 07:34:34 +01:00 · 2025-01-10 07:34:34 +01:00 · f72298364e
commit f72298364e
parent ea403ded23
1 changed files with 338 additions and 0 deletions
--- a/design/contract-deployment.md
+++ b/design/contract-deployment.md
@ -0,0 +1,338 @@
 # Codex Contract Deployment, Upgrades and Security
 We created this document to document our thoughts on contract deployment. We have explored several
 ways how to tackle it, but as it is not only an engineering problem but also involves governance and legal considerations
 it took several iterations before we reached our final design.
 Our main design goal is to have our system as permissionless as possible, yet we need to understand how to do it in the
 most secure way and have prepared scenarios for the worst cases, like discovered unexploited bugs or damage control on
 exploited bugs. In this problem space, there are no simple answers, but it requires weighing all sorts of trade-offs.
 ## Original naive deployment
 Our original vision was to have the most permissionless and simple deployment. We wanted to simply deploy the contracts in
 which there would not be any "admin roles" that would have the power to affect the whole contract and network. Moreover,
 there would not be any "proxy contract" or "upgradable contract," just a simple deployment of our contract suite. The
 smart contract's address would then be hardcoded into the Codex client. Hence, we would not possess any "ownership" over
 the smart contract or have any decision power that would affect the whole network from the smart contract side.
 ### Feature upgrades
 Of course, we expect and have already planned future upgrades of Codex and its smart contracts; therefore, we thought of
 upgrade paths. As mentioned above, we would not have any ownership over the smart contract, so we would propose new
 upgrades through new Codex client releases. New smart contract addresses will be changed in the Codex client after
 deployment and will be part of the next Codex client release.
 We envision a graceful handover period during which the Codex client will be supporting two smart contracts—the
 original deprecated one and the new upgraded one. In this period, the Codex client would create all the new Storage
 Requests on the new smart contract, but it will still fulfill the duties of the Storage Requests of the deprecated
 contract.
 ```
                                ┌─────────────────────────────────────────      
                                │                Contract v2                    
 ┌───────────────────────────────┼────────────────────┬──────────────────────►   
 │           Contract v1         │┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼│                      time
 └───────────────────────────────┼────────────────────┤                          
                                │    Grace period    │                          
                                │                    │                          
                                │                    │                          
                                │                    │                          
                                │                    │                          
                                │                    │                          
                       Codex release with             Codex release that        
                       new smart contract             removes old contract      
 ```
 An important consideration for this upgrade mechanism is that we plan to limit the maximum duration of Storage Requests.
 This ensures that when we release a new version of the Codex client with the upgraded smart contract, we have a
 deterministic date upon which we can sunset and remove the original smart contract from the Codex client, as all Storage
 Requests will be completed. Initially, we plan to limit the duration to one month, and as the network matures and
 upgrades become less frequent, we intend to extend this period.
 > [!WARNING]
 > ❓**Upgrade period's end**❓
 >
 > Even when we have the limit on Storage Request's duration, when should we really remove it from the client? We can't
 > expect everybody to update the client right on the new release day. For example, we should wait double the amount of the
 > duration limit time. Or we could make more advanced decisions based on statistics from the network? Also, the
 > fund-withdraw smart contract calls (eq. for expired Requests) can be delayed by the user even indefinitely. But then,
 > there will be always the old versions of Codex around, so if somebody will remember that he did not withdraw funds for
 > too long, he can always download the old version and spin it up only for the withdrawal calls and then delete it.
 There is one open question, though. If some participants in the network do not follow the latest updates of the Codex
 client, they might end up with a version that still uses the deprecated smart contracts, even though the rest of the
 network might have already moved. We considered some sort of freezing functionality with which we could stop a contract
 once its lifespan is finished. However, we are not so fond of the idea because it would require an admin-role user that
 would have the power to affect the whole network. We argue that it is the responsibility of the node operators to
 update their nodes in a timely manner, and if they wish to fall behind, they are free to stay at their desired version. A potential
 mitigation of this problem could be to set a smart contract expiry, after which it would go into "withdraw only" mode.
 This would not need any admin-role but would require us not to miss the deadline and release a new version every time
 the lifetime of the contract comes to an end.
 ## Security consideration
 Even though we invest significant time and resources to make the smart contracts as safe as possible, there is still a
 very possible risk of exploitable bugs being found after deployment. We can think of two main situations that can occur:
 1. Exploit is discovered and privately disclosed to us.
 2. Exploit is unknown to us and is being/was actively exploited.
 In the first situation, we would be given leeway to correct our bug, but the clock would be ticking as we might be in a
 race with potential exploiters who could be trying to exploit the bug eventually. In this case, we would have to perform
 some sort of emergency upgrade. For such an upgrade, there would need to be no transition period, as the old smart
 contract containing the bug would need to be sunset ASAP. This is in contradiction to our original upgrade path, which
 is timed according to the Storage Request's duration limit.
 In the second situation, we would need to perform the same emergency upgrade, but by the time we
 finish the emergency upgrade, funds might already be stolen. Hence, we would need some mechanism to pause the contract
 in a way that no malicious transfer of funds could occur during this period.
 It is also worth noting that these situations impose a big reputation problem for our project. Some could argue that
 "if we get hacked, we are done". While this could be true to some extent, we should not give up on mitigating these
 situations. In the first situation presented above, we still have a chance to recover from the situation. In the second
 situation, it is our responsibility to do damage control to protect the funds and data of our users, even
 though some of it might be lost in the process.
 ## Mitigations
 We explored several ways how we can mitigate such risks. Unfortunately, all of them require us more or less to relax our
 initial ideals of a permissionless approach.
 ### Upgradable contract
 OpenZeppelin has tooling for writing [Upgradable contracts](https://docs.openzeppelin.com/upgrades-plugins/1.x/). While
 this approach has some technical limitations, it should be sufficient for realizing bug fixes as part of emergency upgrades. 
 This would require an admin role that would have this capability.
 While this solution provides the most capable and flexible way to ensure security, it also gives "too much power" for
 the admin role. With these upgrades, one could change the contract to, for example, withdraw all funds to a certain address
 without the consent of the network. Therefore, it creates liabilities, centralizes control over our decentralized network, and
 has potential legal implications. Therefore, we are reluctant to take this path.
 Another part that is often found in the ecosystem coupled with upgradable contracts
 is [timelocks](https://blog.openzeppelin.com/protect-your-users-with-smart-contract-timelocks). They allow the
 users to take action if they disagree with the proposed upgrade before it is applied. This, though, contradicts the
 need for timely action as we are describing emergency upgrades here and not "feature upgrades".
 ### Freezing contract
 Another approach would be to have an admin role that would be allowed only to freeze the contract. The frozen contract
 would transition to a "withdraw-only" mode, where all the Storage Requests that were running in the contract would get
 terminated up to the time when the contract was frozen, and the funds would be split between the Client and Storage
 Provider according to the service rendered.
 > [!TIP]
 > OpenZeppelin has support for writing [Pausable contracts](https://docs.openzeppelin.com/contracts/2.x/api/lifecycle#pausable).
 The big problem with this approach is that it effectively purges the network. Upon the release of the new version, the
 new contract won't have any of the old Storage Requests. As a result, SPs won't have an incentive to keep the data they
 originally stored. Clients could recreate the old Storage Requests, but they might fail to do so for two
 reasons. First, Clients might not be available in time to do so as they are the most ephemeral participants in the
 network. Second, they might not be in possession of the original dataset at the time of Storage Requests recreation, and
 by that time, the data might already be removed from the original SPs.
 Wiping the network is a big problem for a project that is so focused on durability like ours. Therefore, we should find
 a solution to prevent this. The good news is that generally, participants are motivated to keep the network working in
 the case of these "catastrophic events." It could be expected from them to provide some leeway in the form of downtime,
 unpaid hosting time, sacrificing some payout, etc., in order for the network to recover.
 > [!WARNING]
 > ❓**Unfreezing**❓
 >
 > Should it be possible to also have the capability to "unfreeze" the smart contract? This would require us to rethink
 > the "withdraw-only" mode, but could be beneficial in some circumstances.
 > [!IMPORTANT]
 > The account that has this role assigned should most probably be handled with multi-sig.
 >
 > There are questions about how this multi-sig should be set up: the number of participants, the requirements for quorum
 > etc.
 >
 > In the future, trusted 3rd-parties from the community might also be included.
 #### Warchest
 We have devised an idea to prevent the network from wiping. As it is our responsibility to deploy secure
 contracts, we should also be ready to "pay the price" when our work has an exploit in it. We, therefore, envision drafting a war
 chest of Codex tokens for this occasion. Its funds would then be used to fund the recreation of
 the Storage Requests on the newly deployed smart contracts, which would incentivize original SPs not to ditch the data
 they were hosting before freezing the old smart contract.
 The process would then go along these lines:
 1. The old contract is frozen - everybody can only withdraw their funds (SPs: collateral + partial payout; Clients: rest
   of the payout).
 2. We deploy a new contract with the fix.
 3. We reconstruct the state of the old contract's Storage Requests prior to the exploit being used from the blockchain
   logs.
 4. We recreate the Storage Requests from the rebuilt state using the funds from the war chest. Requests are reserved for
   some time for the original SPs that hosted them.
 5. We release a new version with a fixed smart contract.
 6. SPs update to the new version, and upon starting, the node checks for reservations. If found, they deposit collateral
   back to the Storage Request and continue their original hosting.
 7. (?) Clients can potentially top up the Storage Request to its original price.
 ### Locked funds limit
 Another possibility would be to limit the amount of funds locked inside the contract in order not to attract the
 attention of black hats. We already plan to have the limitation of Storage Request's duration, which already contributes
 in this direction. Although other limitations are a bit questionable. While we want to mitigate possible
 attacks, we do not want to hinder the usability and usefulness of the network by enforcing some artificial limitations
 upon the participants. On the other hand, this could be a potentially useful strategy for initial deployment with clear
 messaging that we will slowly be removing these artificial constraints as the network and smart contract suite matures.
 ## Final design - Contract modularization
 The final design combines the "Freezing" and "Upgradable" mitigation approaches described above.
 This proposal's core is refactoring our current monolithic smart contract structure and splitting it into two new
 contracts — a business logic contract and a funds vault contract.
 We will reference the business logic contract as **Marketplace contract** as it will contain the Marketplace logic.
 It will be possible to perform emergency upgrades using concepts described in the [Upgradable contract](#upgradable-contract) section.
 In this way, if there is a bug/exploit that would affect the funds, it is possible to patch it quickly.
 The original "feature upgrade" path still holds with this approach, where this business logic contract would get upgraded 
 as discussed in the [Feature upgrades](#feature-upgrades) section.
 The admin role would belong to multisig and be maintained inside the organization. This contract would not hold any funds as
 it would delegate them to the Vault contract.
 Vault contract will have the responsibility to keep users' funds safe. It should have minimal logic incorporated to minimize
 the attack surface. Ideally, the Vault contract will be deployed once and never altered as it will have one simple task 
 to do, and there should not be a need to extend it. This contract will not incorporate the upgradability capabilities,
 but as a safety measure, it will be possible to freeze it as described in the [Freezing contract](#freezing-contract) section. 
 Freezing the Vault would be done in case of severe exploitable bugs.
 Thanks to this splitting of the contract, we will limit the liabilities over funds control as described 
 in the [Upgradable contract](#upgradable-contract) section, but at the same it gives us the flexibility to react to 
 unforeseen situations.
 The Vault contract should have logic that prevents the simultaneous draining of all the funds. We came up with two designs for
 this - time-based locking and recipient-based locking. The time-based locking Vault is described in depth below.
 The recipient-base Vault works with a locking schema where the funds have a predefined set of recipients to which the funds
 can be transferred. In this way, the hacker can't redirect the funds to their controlled accounts.
 Unfortunately, this concept is not applicable to Marketplace because of slot repairs, when one slot's host is replaced
 with another, which would require reallocating funds and hence open an opportunity for hackers to redirect the funds to
 their accounts.
 ### Time-based Vault
 This Vault works on locking the funds until a certain time threshold is reached when it allows them to be spent. In this way
 there is only a tiny fraction of the funds possible to be spent at a given time by the "business logic" contract as
 we assume nodes will proactively and quickly collect their funds when able. If there is an exploit on the business logic
 contract, the attacker could withdraw only a small amount of funds.
 We envision the following API of the Vault contract:
 ```solidity
 contract TimeVault {
   /// Creates new deposit with the given amount transferred from account "fromAccount" and locks it till spendable_from_timestamp
   function deposit(uint256 amount, addr fromAccount, uint256 spendable_from_timestamp) returns (DepositId)
   /// Deposits more funds to the already existing deposit 
   function deposit(uint256 amount, addr fromAccount, uint256 spendable_from_timestamp, DepositId id) returns (DepositId)
   /// Extends the timelock of the specified deposit 
   function extend(DepositId id, uint256 spendable_from_timestamp)
   /// Lower deposit amount of funds from the specified deposit
   function burn(DepositId id, uint256 amount)
   /// Transfer the given amount to the recipient, provided the block's timestamp is after the deposit's time lock 
   function spend(DepositId id, addr recipient, uint256 amount)
 }
 ```
 _Important note is that this is a standalone contract, and it needs to keep track of the "owner of the deposit".
 Hence, only the address that performs initial deposit can manipulate the funds later on._
 Integration into the Marketplace contract would be in the following way:
 - `deposit()` reward funds upon `requestStorage()` call with timelock until the Request's `expiry`
   - If the Request starts, the timelock is extended till the Request's end
   - If the Request expires, then funds (partial payouts for hosts and remaining funds for a client) can be withdrawn when requested
 - `deposit(depositId)` collateral funds to existing Request's deposit upon `fillSlot()`
   - If the Request expires, the collateral can be withdrawn together with partial payouts
 - Upon slot's being freed because of the host being kicked out, then `burn()` host's collateral lowered by the amount dedicated repair reward
 - Upon Request's end, collateral and rewards can be `spend()`
 - Upon Request's failure, all host's collateral is `burn()`. The original reward can be `spend()` back to the client, but only after the Request's end
 The main disadvantage of this approach is that when the Request fails, the client can collect its original funds only after the Request's end.
 ### Contract's architecture
 With the contract's split, the dependencies between them will be simple. The Marketplace contract will
 get the address of the Vault contract upon deployment, which it will utilize for all the funds keeping.
 Codex node will not have to be bothered about Vault's existence as every fund's interaction will be proxied
 through the Marketplace contract. The Codex node will have a hardcoded Marketplace contract address with which it will interact.
 ```
 ┌────────────┐     ┌──────────────────────┐     ┌────────────────┐      
 │            │     │                      │     │                │      
 │ Codex Node ├────►│ Marketplace contract ├────►│ Vault contract │      
 │            │     │                      │     │                │      
 └────────────┘     └──────────────────────┘     └────────────────┘      
 ```
 ### Scenarios handling
 With this design, three main scenarios can happen. For completeness and clarity, we will briefly describe how they
 will be handled, even though they have already been described in previous chapters.
 #### Normal feature upgrade
 In a routine feature upgrade, a new version of the Marketplace contract will be deployed. It will utilize the original Vault contract
 , which address is specified upon contract deployment. After that, the deployed contract's address is updated in the
 Codex's auto contract discovery mechanism (eq., hard-coded based on the deployed chain). The old version of the Marketplace contract
 will be deprecated in the Codex node implementation, which will lead to  creating all new storage requests in the new Marketplace contract
 . The old Marketplace contract will be kept around only to fulfill still-running requests and allow the withdrawal of funds.
 For more details about feature upgrades, see the [Feature upgrades](#feature-upgrades) section.
 > [!IMPORTANT]
 > Because of the tracking of the deposit owners in the Vault contract, the funds are safely separated, so multiple versions of
 > Marketplace contract can utilize the same Vault contract.
 #### Emergency logic upgrade
 If an exploitable bug is found in the Marketplace contract, we will utilize the upgradability capability to patch
 this bug. This patch will be applied using a multisig account.
 This capability will allow us to fix major problems in the network without wiping it as described
 in the [Freezing contract](#freezing-contract) section.
 > [!WARNING]
 > The multisig account will have an admin role in the network with capabilities to affect the whole network.
 > Because of this, the participants in the multisig should be trusted, and the quorum should be balanced in order
 > to prevent a hostile takeover while allowing timely response when needed.
 > [!CAUTION]
 > The multisig account will not have direct control over the funds as they will be in the safekeeping of the Vault.
 > But with the current Vault design, only the depositor can manipulate the funds, which in our case
 > is the Marketplace contract that the multisig has control over. Maybe we should consider how users could
 > withdraw funds from the Vault directly without interacting through the Marketplace contract. That will have to be
 > thoroughly considered, as allowing such a thing could hinder the security guarantees of the Vault.
 #### Vault emergency
 The most impactful bug that could be discovered would be in the Vault. If such a bug were discovered and exploited,
 we would have the only possibility to permanently freez the Vault. Such an action would again originate from a multisig
 account. This action would lead to a hard fork of the whole network, where a new token contract would
 probably have to be deployed, and the whole community would have to coordinate around migrating to the new token contract. Most probably, a snapshot of the token distribution before the hack would be taken, and that would be used to instantiate the new
 token. This situation would definitely be disruptive for the whole network, and in the future, we should look more into how this would
 play out in order not to lose the data of users of the network when this happens.
 > [!NOTE]
 > It is a question if the multisig account used for freezing Vault should be the same multisig as the one for the
 > Marketplace's emergency upgrades.
 > [!WARNING]
 > It is a question if _unfreezing_ should be supported. For example, It could help handle the Emergency Marketplace
 > upgrades, where any funds movement could be halted until a bugfix could be applied. But then in the case of a Vault emergency
 > and freezing the Vault because of problem directly with it, it would be desired to ensure that it will never be unfrozen again.
 > It could be possible to implement this by giving up the admin role, where the "allowed account" to make these changes
 > would be set to zero address after the freezing call.
 ## Resources
 - [Ethereum:Smart contract security](https://ethereum.org/pcm/developers/docs/smart-contracts/security/#implement-disaster-recovery-plans)
 - [OpenZeppelin: Strategies for Safer Governance systems](https://blog.openzeppelin.com/smart-contract-security-guidelines-4-strategies-for-safer-governance-systems)
 - [OpenZeppelin: Timelocks](https://blog.openzeppelin.com/protect-your-users-with-smart-contract-timelocks)