We introduce an extensible and versioned container format for the EVM with a once-off validation at deploy time. The version described here brings the tangible benefit of code and data separation, and allows for easy introduction of a variety of changes in the future. This change relies on the reserved byte introduced by [EIP-3541](./eip-3541.md).
On-chain deployed EVM bytecode contains no pre-defined structure today. Code is typically validated in clients to the extent of `JUMPDEST` analysis at runtime, every single time prior to execution. This poses not only an overhead, but also a challenge for introducing new or deprecating existing features.
Validating code during the contract creation process allows code versioning without an additional version field in the account. Versioning is a useful tool for introducing or deprecating features, especially for larger changes (such as significant changes to control flow, or features like account abstraction).
The format described in this EIP introduces a simple and extensible container with a minimal set of changes required to both clients and languages, and introduces validation.
The first tangible feature it provides is separation of code and data. This separation is especially beneficial for on-chain code validators (like those utilised by layer-2 scaling tools, such as Optimism), because they can distinguish code and data (this includes deployment code and constructor arguments too). Currently they a) require changes prior to contract deployment; b) implement a fragile method; or c) implement an expensive and restrictive jump analysis. Code and data separation can result in ease of use and significant gas savings for such use cases. Additionally, various (static) analysis tools can also benefit, though off-chain tools can already deal with existing code, so the impact is smaller.
- Introducing static jumps (with relative addresses) and jump tables, and disallowing dynamic jumps at the same time.
- Requiring code section(s) to be terminated by `STOP`. (Assumptions like this can provide significant speed improvements in interpreters, such as a speed up of ~7% seen in [evmone](https://github.com/ethereum/evmone/pull/295).)
- Multi-byte opcodes without any workarounds.
- Representing functions as individual code sections instead of subroutines.
In order to guarantee that every EOF-formatted contract in the state is valid, we need to prevent already deployed (and not validated) contracts from being recognized as such format. This is achieved by choosing a byte sequence for the *magic* that doesn't exist in any of the already deployed contracts.
*For purely reference purposes we call the `0xEF` byte the `FORMAT`.*
The *initcode* is the code executed in the context of the *create* transaction, `CREATE`, or `CREATE2` instructions. The *initcode* returns *code* (via the `RETURN` instruction), which is inserted into the account. See section 7 ("Contract Creation") in the Yellow Paper for more information.
The opcode `0xEF` is currently an undefined instruction, therefore: *It pops no stack items and pushes no stack items, and it causes an exceptional abort when executed.* This means *initcode* or already deployed *code* starting with this instruction will continue to abort execution.
In this fork we introduce _code validation_ for new contract creation. To achieve this, we define a format called EVM Object Format (EOF), containing a version indicator, and a ruleset of validity tied to a given version.
At `block.number == HF_BLOCK` new contract creation is modified:
- if *initcode* or *code* starts with the *EOF prefix*, it is considered to be EOF formatted and will undergo validation specified in the following sections,
EVM and/or account versioning has been discussed numerous times over the past years. This proposal aims to learn from them. See [this collection of previous proposals](https://ethereum-magicians.org/t/ethereum-account-versioning/3508) for a good starting point.
- All created contracts with *EOFn* prefix are valid according to version *n* rules. This is very strong and useful property. The client can trust that the deployed code is well-formed.
- In future, this allows to serialize `JUMPDEST` map in the EOF container and eliminate the need of implicit `JUMPDEST` analysis required before execution.
- Or to completely remove the need for `JUMPDEST` instructions.
- The biggest disadvantage is that deploy-time validation of EOF code must be enabled in two hard-forks. However, the first step ([EIP-3541](./eip-3541.md)) is already deployed in London.
The alternative is to have execution time validation for EOF. This is performed every single time a contract is executed, however clients may be able to cache validation results. This _alternative_ approach has the following properties:
- Because the validation is consensus-level execution step, it means the execution always requires the entire code. This makes _code merkleization impractical_.
- Can be enabled via a single hard-fork.
- Better backwards compatibility: data contracts starting with the `0xEF` byte or the *EOF prefix* can be deployed. This is a dubious benefit however.
The [Changes to contact creation semantics](#changes-to-contract-creation-semantics) section defines minimal set of restrictions related to the contract creation: if _initcode_ or _code_ has the EOF1 container prefix it must be validated. This adds two validation steps in the contract creation, any of it failing will result in contract creation failure.
1. The EOF version of _initcode_ must much the version of _code_.
2. An EOF1 contract must not create legacy contracts.
Finally, create transaction must be allowed to contain legacy _initcode_ and deploy legacy _code_ because otherwise there is no transition period allowing upgrading transaction signing tools. Deprecating such transactions may be considered in future.
- Streaming headers (i.e. `section_header, section_data, section_header, section_data, ...`) are used in some other formats (such as WebAssembly). They are handy for formats which are subject to editing (adding/removing sections). That is not a useful feature for EVM. One minor benefit applicable to our case is that they do not require a specific "header terminator". On the other hand they seem to play worse with code chunking / merkleization, as it is better to have all section headers in a single chunk.
- Whether to have a header terminator or to encode `number_of_sections` or `total_size_of_headers`. Both raise the question how large of a value these fields should be able to hold. While today there will be only two sections, in case each "EVM function" would become a separate code section, a fixed 8-bit field may not be big enough. A terminator byte seems to avoid these problems.
- Whether to encode `section_size` as a fixed 16-bit value or some kind of variable length field (e.g. [LEB128](https://en.wikipedia.org/wiki/LEB128)). We have opted for fixed size, because it simplifies client implementations, and 16-bit seems enough, because of the currently exposed code size limit of 24576 bytes (see [EIP-170](./eip-170.md) and [EIP-2677](./eip-2677.md)). Should this be limiting in the future, a new EOF version could change the format. Besides simplifying client implementations, not using LEB128 also greatly simplifies on-chain parsing.
### PC starts with 0 at the code section
The values for `PC` and `JUMP`/`JUMPI` start with 0 and are within the *code* section. We considered keeping `PC`/`JUMP`/`JUMPI` values to operate on the whole *container* and be consistent with `CODECOPY`/`EXTCODECOPY` but in the end decided otherwise. It looks to be much easier to propose EOF extensions that affect jumps and jumpdests when `JUMP`/`JUMPI` already operates on indexes within *code* section only. This also feels more natural and easier to implement in EVM: the new EOF EVM should only care about traversing *code* and accessing other parts of the *container* only on special occasions (e.g. in `CODECOPY` instruction).
This is a breaking change given that any code starting with `0xEF` was not deployable before (and resulted in exceptional abort if executed), but now some subset of such codes can be deployed and executed successfully.
Given the rigid rules of EOF1 it is possible to implement support for the container in clients using very simple pattern matching:
```python
FORMAT = 0xEF
MAGIC = 0x00 # To be defined
VERSION = 0x01
S_TERMINATOR = 0x00
S_CODE = 0x01
S_DATA = 0x02
def validate_eof(code: bytes):
total_size = 0
if len(code) > 7 and code[0] == FORMAT and code[1] == MAGIC and code[2] == VERSION and code[3] == S_CODE and code[6] == S_TERMINATOR:
total_size = 7 + ((code[4] <<8)|code[5])
elif len(code) > 10 and code[0] == FORMAT and code[1] == MAGIC and code[2] == VERSION and code[3] == S_CODE and code[6] == S_DATA and code[9] == S_TERMINATOR:
However, future versions may introduce more sections or loosen up restrictions, requiring clients to actually parse sections instead of pattern matching.
Proposed validation rules can be checked at constant time, therefore it should not be easily attackable. This is subject to change with future extensions.
Currently *initcode* validation has no extra cost and the currently charged creation costs should be sufficient, however we consider adding an additional gas cost for contract creation.