Merge pull request #16 from multiformats/spec/update

The unified multicodecs theory
This commit is contained in:
David Dias 2016-11-16 11:38:31 +00:00 committed by GitHub
commit d6e0ec1b94
3 changed files with 134 additions and 223 deletions

146
README.md
View File

@ -4,15 +4,14 @@
[![](https://img.shields.io/badge/project-multiformats-blue.svg?style=flat-square)](http://github.com/multiformats/multiformats)
[![](https://img.shields.io/badge/freenode-%23ipfs-blue.svg?style=flat-square)](http://webchat.freenode.net/?channels=%23ipfs)
> self-describing codecs
> compact self-describing codecs. Save space by using predefined multicodec tables.
## Table of Contents
- [Motivation](#motivation)
- [How does it work? - Protocol Description](#how-does-it-work---protocol-description)
- [Prefix examples](#prefix-examples)
- [prefix - codec - desc](#prefix---codec---desc)
- [The protocol path](#the-protocol-path)
- [Multicodec tables](#multicodec-tables)
- [Standard multicodec table](#standard-mcp-protocol-table)
- [Implementations](#implementations)
- [FAQ](#faq)
- [Maintainers](#maintainers)
@ -21,136 +20,83 @@
## Motivation
Multicodecs are self-describing protocol/encoding streams. (Note that a file is a stream). It's designed to address the perennial problem:
[Multistreams](https://github.com/multiformats/multistream) are self-describing protocol/encoding streams. Multicodec uses an agreed-upon "protocol table". It is designed for use in short strings, such as keys or identifiers (i.e [CID](https://github.com/ipld/cid)).
> I have a bitstring, what codec is the data coded with!?
## Protocol Description - How does the protocol work?
Instead of arguing about which data serialization library is the best, let's just pick the simplest one now, and build _upgradability_ into the system. Choices are never _forever_. Eventually all systems are changed. So, embrace this fact of reality, and build change into your system now.
`multicodec` is a _self-describing multiformat_, it wraps other formats with a tiny bit of self-description. A multicodec identifier is both a varint and the code identifying the following data, this means that the most significant bit of every multicodec code is reserved to signal the continuation.
Multicodec frees you from the tyranny of past mistakes. Instead of trying to figure it all out beforehand, or continue using something that we can all agree no longer fits, why not allow the system to _evolve_ and _grow_ with the use cases of today, not yesterday.
To decode an incoming stream of data, a program must either (a) know the format of the data a priori, or (b) learn the format from the data itself. (a) precludes running protocols that may provide one of many kinds of formats without prior agreement on which. multistream makes (b) neat using self-description.
Moreover, this self-description allows straightforward layering of protocols without having to implement support in the parent (or encapsulating) one.
## How does it work? - Protocol Description
`multicodec` is a _self-describing multiformat_, it wraps other formats with a tiny bit of self-description:
This way, a chunk of data identified by multicodec will look like this:
```sh
<multicodec-header><encoded-data>
# or
<varint-len><code>\n<encoded-data>
<multicodec-varint><encoded-data>
# To reduce the cognitive load, we sometimes might write the same line as:
<mcp><data>
```
For example, let's encode a json doc:
```node
> // encode some json
> var str = JSON.stringify({"hello":"world"})
> var buf = multicodec.encode('json', str) // prepends multistream.header('/json')
> buf
<Buffer 06 2f 6a 73 6f 6e 2f 7b 22 68 65 6c 6c 6f 22 3a 22 77 6f 72 6c 64 22 7d>
> buf.toString('hex')
062f6a736f6e2f7b2268656c6c6f223a22776f726c64227d
> // decode, and find out what is in buf
> multicodec.decode(buf)
{ "codec": "json", "data": '{"hello": "world"}' }
```
So, `buf` is:
Another useful scenario is when using the multicodec-packed as part of the keys to access data, example:
```
hex: 062f6a736f6e2f7b2268656c6c6f223a22776f726c64227d
ascii: json/\n{"hello":"world"}
# suppose we have a value and a key to retrieve it
"<key>" -> <value>
# we can use multicodec-packed with the key to know what codec the value is in
"<mcp><key>" -> <value>
```
The more you know! Let's try it again, this time with protobuf:
It is worth noting that multicodec-packed works very well in conjunction with [multihash](https://github.com/multiformats/multihash) and [multiaddr](https://github.com/multiformats/multiaddr), as you can prefix those values with a multicodec-packed to tell what they are.
```
cat proto.c
```
## MulticodecProtocol Tables
See also: [multicodec-packed](./multicodec-packed.md).
Multicodec uses "protocol tables" to agree upon the mapping from one multicodec code (a single varint). These tables can be application specific, though -- like [with](https://github.com/multiformats/multihash) [other](https://github.com/multiformats/multibase) [multiformats](https://github.com/multiformats/multiaddr) -- we will keep a globally agreed upon table with common protocols and formats.
## Prefix examples
## Multicodec table
The full table can be found at [table.csv](/table.csv) inside this repo.
| prefix | codec | desc | type | [packed encoding](https://github.com/multiformats/multicodec/blob/master/multicodec-packed.md)|
|----------------|-------|-------------|-------|---------------------------------------|
|0x052f62696e2f | /bin/ |raw binary |binary | 0x00 |
|0x042f62322f | /b2/ |ascii base2 |binary | |
|0x052f6231362f | /b16/ |ascii base16 |hex | |
|0x052f6233322f | /b32/ |ascii base32 | | |
|0x052f6235382f | /b58/ |ascii base58 | | |
|0x052f6236342f | /b64/ |ascii base64 | | |
|0x062f6a736f6e2f |/json/ | |json | |
|0x062f63626f722f |/cbor/ | |json | |
|0x062f62736f6e2f |/bson/ | |json | |
|0x072f626a736f6e2f|/bjson/| |json | |
|0x082f75626a736f6e2f| /ubjson/| |json | |
|0x182f6d756c7469636f6465632f | /multicodec/ | | multiformat | 0x40 |
|0x162f6d756c7469686173682f | /multihash/ | | multiformat | 0x41 |
|0x162f6d756c7469616464722f | /multiaddr/ | | multiformat | 0x42 |
|0x0a2f70726f746f6275662f |/protobuf/ | Protocol Buffers |protobuf| |
|0x072f6361706e702f | /capnp/ | Cap-n-Proto |protobuf| |
|0x092f666c61746275662f |/flatbuf/ | FlatBuffers |protobuf| |
|0x052f7461722f |/tar/ | | archive | |
|0x052f7a69702f |/zip/ | | archive | |
|0x052f706e672f | /png/ | | archive | |
|0x052f726c702f | /rlp/ | recursive length prefix | ethereum | 0x60 |
## The protocol path
### Adding new multicodecs to the table
`multicodec` allows us to specify different protocols in a universal namespace, that way being able to recognize, multiplex, and embed them easily. We use the notion of a `path` instead of an `id` because it is meant to be a Unix-friendly URI.
The process to add a new multicodec to the table is the following:
A good path name should be decipherable -- meaning that if some machine or developer -- who has no idea about your protocol -- encounters the path string, they should be able to look it up and resolve how to use it.
- 1. Fork this repo
- 2. Update the table with the value you want to add
- 3. Submit a Pull Request
An example of a good path name is:
```
/bittorrent.org/1.0
```
An example of a _great_ path name is:
```
/ipfs/Qmaa4Rw81a3a1VEx4LxB7HADUAXvZFhCoRdBzsMZyZmqHD/ipfs.protocol
/http/w3id.org/ipfs/ipfs-1.1.0.json
```
These path names happen to be resolvable -- not just in a "multicodec muxer(e.g [multistream]())" but -- in the internet as a whole (provided the program (or OS) knows how to use the `/ipfs` and `/http` protocols).
This ["first come, first assign"](https://github.com/multiformats/multicodec/pull/16#issuecomment-260146609) policy is a way to assign codes as they are most needed, without increasing the size of the table (and therefore the size of the multicodecs) too rapidly.
## Implementations
- [go-multicodec](https://github.com/multiformats/go-multicodec)
- [go-multistream](https://github.com/multiformats/go-multistream) - Implements multistream, which uses multicodec for stream negotiation
- [js-multistream](https://github.com/multiformats/js-multistream-select) - Implements multistream, which uses multicodec for stream negotiation
- [clj-multicodec](https://github.com/multiformats/clj-multicodec)
- [go](https://github.com/multiformats/go-multicodec/)
- [JavaScript](https://github.com/multiformats/js-multicodec)
- [Add yours today!](https://github.com/multiformats/multicodec/edit/master/multicodec.md)
## Multicodec Path, also known as [`multistream`](https://github.com/multiformats/multistream)
Multicodec defines a table for the most common data serialization formats that can be expanded overtime or per application bases, however, in order for two programs to talk with each other, they need to know before hand which table or table extension is being used.
In order to enable self descriptive data formats or streams that can be dynamically described, without the formal set of adding a binary packed code to a table, we have [`multistream`](https://github.com/multiformats/multistream), so that applications can adopt multiple data formats for their streams and with that create different protocols.
## FAQ
> **Q. I have questions on multicodec, not listed here.**
That's not a question. But, have you checked the proper [multicodec FAQ](./README.md#faq)? Maybe your question is answered there. This FAQ is only specifically for multicodec-packed.
> **Q. Why?**
Today, people speak many languages, and use common ones to interface. But every "common language" has evolved over time, or even fundamentally switched. Why should we expect programs to be any different?
Because [multistream](https://github.com/multiformats/multistream) is too long for identifiers. We needed something shorter.
And the reality is they're not. Programs use a variety of encodings. Today we like JSON. Yesterday, XML was all the rage. XDR solved everything, but it's kinda retro. Protobuf is still too cool for school. capnp ("cap and proto") is
for cerealization hipsters.
> **Q. Why varints?**
The one problem is figuring out what we're speaking. Humans are pretty smart, we pick up all sorts of languages over time. And we can always resort to pointing and grunting (the ascii of humanity).
So that we have no limitation on protocols. Implementation note: you do not need to implement varints until the standard multicodec table has more than 127 functions.
Programs have a harder time. You can't keep piping json into a protobuf decoder and hope they align. So we have to help them out a bit. That's what multicodec is for.
> **Q. What kind of varints?**
> **Q. Why "codec" and not "encoder" and "decoder"?**
An Most Significant Bit unsigned varint, as defined by the [multiformats/unsigned-varint](https://github.com/multiformats/unsigned-varint).
Because they're the same thing. Which one of these is the encoder and which the decoder?
> **Q. Don't we have to agree on a table of protocols?**
5555 ----[ THING ]---> 8888
5555 <---[ THING ]---- 8888
> **Q. Full paths are too big for my use case, is there something smaller?**
Yes, check out [multicodec-packed](./multicodec-packed.md). It uses a varint and a table to achieve the same thing.
Yes, but we already have to agree on what protocols themselves are, so this is not so hard. The table even leaves some room for custom protocol paths, or you can use your own tables. The standard table is only for common things.
## Maintainers

View File

@ -1,123 +0,0 @@
# multicodec-packed
[![](https://img.shields.io/badge/made%20by-Protocol%20Labs-blue.svg?style=flat-square)](http://ipn.io)
[![](https://img.shields.io/badge/project-multiformats-blue.svg?style=flat-square)](http://github.com/multiformats/multiformats)
[![](https://img.shields.io/badge/freenode-%23ipfs-blue.svg?style=flat-square)](http://webchat.freenode.net/?channels=%23ipfs)
> compact self-describing codecs. Save space by using predefined multicodec tables.
## Table of Contents
- [Motivation](#motivation)
- [How does it work? - Protocol Description](#how-does-it-work---protocol-description)
- [Multicodec-Packed Protocol Tables](#multicodec-packed-protocol-tables)
- [Standard mcp protocol table](#standard-mcp-protocol-table)
- [Implementations](#implementations)
- [FAQ](#faq)
- [Maintainers](#maintainers)
- [Contribute](#contribute)
- [License](#license)
## Motivation
[Multicodecs](./README.md) are self-describing protocol/encoding streams. Multicodec-packed is a different representation of multicodec, which uses an agreed-upon "protocol table". It is designed for use in short strings, such as keys or identifiers (i.e [CID](https://github.com/ipld/cid)).
## Protocol Description - How does the protocol work?
`multicodec-packed` is a _self-describing multiformat_, it wraps other formats with a tiny bit of self-description. A multicodec-packed identifier is both a varint and the code identifying the following data, this means that the most significant bit of every multicodec-packed code is reserved to signal the continuation.
This way, a chunk of data identified by multicodec will look like this:
```sh
<multicodec-packed-varint><encoded-data>
# To reduce the cognitive load, we sometimes might write the same line as:
<mcp><data>
```
Another useful scenario is when using the multicodec-packed as part of the keys to access data, example:
```
# suppose we have a value and a key to retrieve it
"<key>" -> <value>
# we can use multicodec-packed with the key to know what codec the value is in
"<mcp><key>" -> <value>
```
It is worth noting that multicodec-packed works very well in conjunction with [multihash](https://github.com/multiformats/multihash) and [multiaddr](https://github.com/multiformats/multiaddr), as you can prefix those values with a multicodec-packed to tell what they are.
## Multicodec-Packed Protocol Tables
Multicodec-packed uses "protocol tables" to agree upon the mapping from one multicodec-packed code (a single varint). These tables map an `<mcp-code>` to a full [multicodec protocol path](./README.md#the-protocol-path). These tables can be application specific, though -- like [with](https://github.com/multiformats/multihash) [other](https://github.com/multiformats/multibase) [multiformats](https://github.com/multiformats/multiaddr) -- we will keep a globally agreed upon table with common protocols and formats.
### Standard mcp protocol table
This is the standard multicodec-packed protocol table.
**WARNING: WIP. this table is not ready for wide use.**
TODO:
- [ ] See if IANA has a ready-made table for us to use here. Even just a listing of the most popular formats would be good enough.
```sh
code codec
# Miscellaneous
0x00 raw binary data
# Multiformats
0x40 multicodec
0x41 multihash
0x42 multiaddr
# Serialization formats (cbor, ion, protobuf, etc)
# TODO
# VCS'es formats (git, hg, SVN, etc)
# TODO
# Blockchain block types (bitcoin, ethereum, stellar, etc)
# TODO
```
## Implementations
- [go](https://github.com/multiformats/go-multicodec-packed/)
- [JavaScript](https://github.com/multiformats/js-multicodec-packed)
- [Add yours today!](https://github.com/multiformats/multicodec/edit/master/multicodec-packed.md)
## FAQ
> **Q. I have questions on multicodec, not listed here.**
That's not a question. But, have you checked the proper [multicodec FAQ](./README.md#faq)? Maybe your question is answered there. This FAQ is only specifically for multicodec-packed.
> **Q. Why?**
Because [multicodec](./README.md) is too long for identifiers. We needed something shorter.
> **Q. Why varints?**
So that we have no limitation on protocols. Implementation note: you do not need to implement varints until the standard multicodec table has more than 127 functions.
> **Q. What kind of varints?**
An Most Significant Bit unsigned varint, as defined by the [multiformats/unsigned-varint](https://github.com/multiformats/unsigned-varint).
> **Q. Don't we have to agree on a table of protocols?**
Yes, but we already have to agree on what protocols themselves are, so this is not so hard. The table even leaves some room for custom protocol paths, or you can use your own tables. The standard table is only for common things.
## Maintainers
Captain: [@jbenet](https://github.com/jbenet).
## Contribute
Contributions welcome. Please check out [the issues](https://github.com/multiformats/multicodec/issues).
Check out our [contributing document](https://github.com/multiformats/multiformats/blob/master/contributing.md) for more information on how we work, and about contributing in general. Please be aware that all interactions related to multiformats are subject to the IPFS [Code of Conduct](https://github.com/ipfs/community/blob/master/code-of-conduct.md).
## License
[MIT](LICENSE)

88
table.csv Normal file
View File

@ -0,0 +1,88 @@
codec, description, code
miscelaneous,,
bin, raw binary, 0x55
bases encodings,,
base1, unary, 0x01
base2, binary (0 and 1), 0x55
base8, octal, 0x07
base10, decimal, 0x09
base16, hexadecimal, 0x
base32, rfc4648, 0x
base32hex, rfc4648, 0x
base58flickr, base58 flicker, 0x
base58btc, base58 bitcoin, 0x
base64, rfc4648, 0x
base64url, rfc4648, 0x
serialization formats,,
cbor, CBOR, 0x
bson, Binary JSON, 0x
ubjson, Universal Binary JSON, 0x
protobuf, Protocol Buffers, 0x
capnp, Cap-n-Proto, 0x
flatbuf, FlatBuffers, 0x
rlp, recursive length prefix, 0x60
multiformats,,
multicodec, , 0x30
multihash, , 0x31
multiaddr, , 0x32
multibase, , 0x33
multihashes,,
sha1, , 0x11
sha2-256, , 0x12
sha2-512, , 0x13
sha3-224, , 0x17
sha3-256, , 0x16
sha3-384, , 0x15
sha3-512, , 0x14
shake-128, , 0x18
shake-256, , 0x19
keccak-224, , 0x1A
keccak-256, , 0x1B
keccak-384, , 0x1C
keccak-512, , 0x1D
,, Note: keccak has variable output length. The number specifies the core length
blake2b, , 0x40
blake2s, , 0x41
reserved for apps, appl specific range, 0x4000-0x40f0
multiaddrs,,
ip4, , 0x04
ip6, , 0x29
tcp, , 0x06
udp, , 0x0111
dccp, , 0x21
sctp, , 0x84
udt, , 0x012D
utp, , 0x012E
ipfs, , 0x2A
http, , 0x01E0
https, , 0x01BB
ws, , 0x01DD
onion, , 0x01BC
archiving formats,,
tar, , 0x
zip, , 0x
image formats,,
png, , 0x
jpg, , 0x
video formats,,
mp4, , 0x
mkv, , 0x
IPLD formats,,
dag-pb, MerkleDAG protobuf, 0x70
dag-cbor, MerkleDAG cbor, 0x71
eth-block, Ethereum Block (RLP), 0x90
eth-tx, Ethereum Tx (RLP), 0x91
bitcoin-block, Bitcoin Block, 0xb0
bitcoin-tx, Bitcoin Tx, 0xb1
stellar-block, Stellar Block, 0xd0
stellar-tx, Stellar Tx, 0xd1
1 codec description code
2 miscelaneous
3 bin raw binary 0x55
4 bases encodings
5 base1 unary 0x01
6 base2 binary (0 and 1) 0x55
7 base8 octal 0x07
8 base10 decimal 0x09
9 base16 hexadecimal 0x
10 base32 rfc4648 0x
11 base32hex rfc4648 0x
12 base58flickr base58 flicker 0x
13 base58btc base58 bitcoin 0x
14 base64 rfc4648 0x
15 base64url rfc4648 0x
16 serialization formats
17 cbor CBOR 0x
18 bson Binary JSON 0x
19 ubjson Universal Binary JSON 0x
20 protobuf Protocol Buffers 0x
21 capnp Cap-n-Proto 0x
22 flatbuf FlatBuffers 0x
23 rlp recursive length prefix 0x60
24 multiformats
25 multicodec 0x30
26 multihash 0x31
27 multiaddr 0x32
28 multibase 0x33
29 multihashes
30 sha1 0x11
31 sha2-256 0x12
32 sha2-512 0x13
33 sha3-224 0x17
34 sha3-256 0x16
35 sha3-384 0x15
36 sha3-512 0x14
37 shake-128 0x18
38 shake-256 0x19
39 keccak-224 0x1A
40 keccak-256 0x1B
41 keccak-384 0x1C
42 keccak-512 0x1D
43 Note: keccak has variable output length. The number specifies the core length
44 blake2b 0x40
45 blake2s 0x41
46 reserved for apps appl specific range 0x4000-0x40f0
47 multiaddrs
48 ip4 0x04
49 ip6 0x29
50 tcp 0x06
51 udp 0x0111
52 dccp 0x21
53 sctp 0x84
54 udt 0x012D
55 utp 0x012E
56 ipfs 0x2A
57 http 0x01E0
58 https 0x01BB
59 ws 0x01DD
60 onion 0x01BC
61 archiving formats
62 tar 0x
63 zip 0x
64 image formats
65 png 0x
66 jpg 0x
67 video formats
68 mp4 0x
69 mkv 0x
70 IPLD formats
71 dag-pb MerkleDAG protobuf 0x70
72 dag-cbor MerkleDAG cbor 0x71
73 eth-block Ethereum Block (RLP) 0x90
74 eth-tx Ethereum Tx (RLP) 0x91
75 bitcoin-block Bitcoin Block 0xb0
76 bitcoin-tx Bitcoin Tx 0xb1
77 stellar-block Stellar Block 0xd0
78 stellar-tx Stellar Tx 0xd1