Revamp presentation - add PhD Thesis on VM + Forth benchmarks of threading techniques
parent
cbbbd4a1c9
commit
7dd55007e0
|
@ -3,34 +3,33 @@ Target audience is Nimbus developers.
|
|||
|
||||
## Pure interpreter
|
||||
|
||||
| Description | Link |
|
||||
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||
| Basic overview of computed gotos | https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables |
|
||||
| Optimizing direct threaded code by selective inlining (Paper from 1998 which includes JIT introduction with code!)| http://flint.cs.yale.edu/jvmsem/doc/threaded.ps |
|
||||
| Design of a bytecode interpreter, including Stack vs Register, how to represent values (single type, tagged unions, untagged union, interface/virtual function) | http://gameprogrammingpatterns.com/bytecode.html |
|
||||
| Writing a fast interpreter: control-flow graph optimization from LuaJIT author | http://lua-users.org/lists/lua-l/2011-02/msg00742.html |
|
||||
| In-depth dive on how to write an emulator | http://fms.komkon.org/EMUL8/HOWTO.html |
|
||||
| Review of interpreter dispatch strategies to limit branch mispredictions: direct threaded code vs indirect threaded code vs token threaded code vs switch based dispatching vs replicated switch dispatching + Bibliography | http://realityforge.org/code/virtual-machines/2011/05/19/interpreters.html |
|
||||
| Fast VMs without assembly - speeding up the interpreter loop: threaded interpreter, duff's device, JIT, Nostradamus distributor | http://www.emulators.com/docs/nx25_nostradamus.htm |
|
||||
| Switch case vs Table vs Function caching/dynarec | http://ngemu.com/threads/switch-case-vs-function-table.137562/ |
|
||||
| Jump tables vs Switch | http://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html |
|
||||
| Paper: branch prediction and the performance of Interpreters - Don't trust the folklore | https://hal.inria.fr/hal-01100647/document |
|
||||
| Paper by author of ANTLR: The Structure and Performance of Efficient Interpreters | https://www.jilp.org/vol5/v5paper12.pdf |
|
||||
| Paper by author of ANTLR introducing dynamic replication: Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreter | https://www.scss.tcd.ie/David.Gregg/papers/toplas05.pdf |
|
||||
| Benchmarking VM Dispatch strategies in Rust: Switch vs unrolled switch vs tail call dispatch vs Computed Gotos | https://pliniker.github.io/post/dispatchers/ |
|
||||
| Computed Gotos for fast dispatching in Python | https://github.com/python/cpython/blob/9d6171ded5c56679bc295bacffc718472bcb706b/Python/ceval.c#L571-L608 |
|
||||
* Threading techniques for Forth (indirect, Direct, Token, Switch, Call, Segment threading) - [link](http://www.complang.tuwien.ac.at/forth/threaded-code.html#call-threading)
|
||||
* Benchmark of interpreter dispatch techniques for Forth on x86, PPC, MIPS, SPARC, Itanium and ARM - [link](http://www.complang.tuwien.ac.at/forth/threading/)
|
||||
* PhD Thesis: Virtual machine Showdown: Stack vs Registers, with review of ALL interpreter dispatch techniques - [link](https://www.scss.tcd.ie/publications/tech-reports/reports.07/TCD-CS-2007-49.pdf)
|
||||
* Basic overview of computed gotos - [link](https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables)
|
||||
* Optimizing direct threaded code by selective inlining (Paper from 1998 which includes JIT introduction with code!) - [link](http://flint.cs.yale.edu/jvmsem/doc/threaded.ps)
|
||||
* Design of a bytecode interpreter, including Stack vs Register, how to represent values (single type, tagged unions, untagged union, interface/virtual function) - [link](http://gameprogrammingpatterns.com/bytecode.html)
|
||||
* Writing a fast interpreter: control-flow graph optimization from LuaJIT author - [link](http://lua-users.org/lists/lua-l/2011-02/msg00742.html)
|
||||
* In-depth dive on how to write an emulator - [link](http://fms.komkon.org/EMUL8/HOWTO.html)
|
||||
* Review of interpreter dispatch strategies to limit branch mispredictions: direct threaded code vs indirect threaded code vs token threaded code vs switch based dispatching vs replicated switch dispatching + Bibliography - [link](http://realityforge.org/code/virtual-machines/2011/05/19/interpreters.html)
|
||||
* Fast VMs without assembly - speeding up the interpreter loop: threaded interpreter, duff's device, JIT, Nostradamus distributor by the author of Bosch x86 emulator - [link](http://www.emulators.com/docs/nx25_nostradamus.htm)
|
||||
* Switch case vs Table vs Function caching/dynarec - [link](http://ngemu.com/threads/switch-case-vs-function-table.137562/)
|
||||
* Jump tables vs Switch - [link](http://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html)
|
||||
* Paper: branch prediction and the performance of Interpreters - Don't trust the folklore - [link](https://hal.inria.fr/hal-01100647/document)
|
||||
* Paper by author of ANTLR: The Structure and Performance of Efficient Interpreters - [link](https://www.jilp.org/vol5/v5paper12.pdf)
|
||||
* Paper by author of ANTLR introducing dynamic replication: Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreter - [link](https://www.scss.tcd.ie/David.Gregg/papers/toplas05.pdf)
|
||||
* Benchmarking VM Dispatch strategies in Rust: Switch vs unrolled switch vs tail call dispatch vs Computed Gotos - [link](https://pliniker.github.io/post/dispatchers/)
|
||||
* Computed Gotos for fast dispatching in Python - [link](https://github.com/python/cpython/blob/9d6171ded5c56679bc295bacffc718472bcb706b/Python/ceval.c#L571-L608)
|
||||
|
||||
## JIT / Dynamic recompilation
|
||||
|
||||
| Description | Link |
|
||||
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||
| Optimizing direct threaded code by selective inlining | http://flint.cs.yale.edu/jvmsem/doc/threaded.ps |
|
||||
| Dynamic recompilation introduction | http://ngemu.com/threads/dynamic-recompilation-an-introduction.20491/ |
|
||||
| Dynamic recompilation guide with Chip8 | https://github.com/marco9999/Dynarec_Guide/blob/master/Introduction%20to%20Dynamic%20Recompilation%20in%20Emulation.pdf |
|
||||
| Dynamic recompilation - accompanying source code | https://github.com/marco9999/Super8_jitcore/ |
|
||||
| Presentation: Interpretation (basic indirect and direct threaded) vs binary translation | http://www.ittc.ku.edu/~kulkarni/teaching/EECS768/slides/chapter2.pdf |
|
||||
| Threaded interpretation vs Dynarec | http://www.emutalk.net/threads/55275-Threaded-interpretation-vs-Dynamic-Binary-Translation |
|
||||
| Dynamic recompilation wiki | http://emulation.gametechwiki.com/index.php/Dynamic_recompilation |
|
||||
* Optimizing direct threaded code by selective inlining - [link](http://flint.cs.yale.edu/jvmsem/doc/threaded.ps)
|
||||
* Dynamic recompilation introduction - [link](http://ngemu.com/threads/dynamic-recompilation-an-introduction.20491/)
|
||||
* Dynamic recompilation guide with Chip8 - [link](https://github.com/marco9999/Dynarec_Guide/blob/master/Introduction%20to%20Dynamic%20Recompilation%20in%20Emulation.pdf)
|
||||
* Dynamic recompilation - accompanying source code - [link](https://github.com/marco9999/Super8_jitcore/)
|
||||
* Presentation: Interpretation (basic indirect and direct threaded) vs binary translation - [link](http://www.ittc.ku.edu/~kulkarni/teaching/EECS768/slides/chapter2.pdf)
|
||||
* Threaded interpretation vs Dynarec - [link](http://www.emutalk.net/threads/55275-Threaded-interpretation-vs-Dynamic-Binary-Translation)
|
||||
* Dynamic recompilation wiki - [link](http://emulation.gametechwiki.com/index.php/Dynamic_recompilation)
|
||||
|
||||
## Context Threading
|
||||
|
||||
|
@ -41,6 +40,7 @@ that makes interpretation nice with the hardware branch predictor. Practical imp
|
|||
- [Paper](http://www.cs.toronto.edu/~matz/pubs/demkea_context.pdf)
|
||||
- [Powerpoint](https://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf)
|
||||
- [Review / Critic](https://www.complang.tuwien.ac.at/anton/lvas/sem06w/fest.pdf)
|
||||
- Cited and reviewed in [Virtual Machine Showdown PhD Thesis](https://www.scss.tcd.ie/publications/tech-reports/reports.07/TCD-CS-2007-49.pdf)
|
||||
|
||||
Basically, instead of computed goto, you have computed "call" and each section called is ended by
|
||||
the ret (return) instruction. Note that it the address called is still inline, there is no parameter pushed on the stack.
|
||||
|
@ -61,7 +61,7 @@ arbitrary call and ret instructions.
|
|||
- [Bochs x86 emulator](https://sourceforge.net/projects/bochs/)
|
||||
- [Virtualization without Execution: Designing a portable VM - Powerpoint](http://bochs.sourceforge.net/VirtNoJit.pdf)
|
||||
- [Virtualization without Execution - Paper](http://bochs.sourceforge.net/Virtualization_Without_Hardware_Final.pdf)
|
||||
- Author is also the author of the Nostradamus Distributor linked in pure itnerpreter optimizations
|
||||
- Author is also the author of the Nostradamus Distributor linked in pure interpreter optimizations
|
||||
- MorphoVM
|
||||
- Thesis: [Morpho VM: An Indirect Threaded Stackless
|
||||
Virtual Machine](https://skemman.is/bitstream/1946/4809/1/hhg-bs.pdf)
|
||||
|
@ -374,9 +374,9 @@ let initial = if arguments.len > 0: parseInt($arguments[0])
|
|||
|
||||
main(initial)
|
||||
|
||||
## Results on i5-5257U (Broadwell mobile dual core 2.7 turbo 3.1Ghz)
|
||||
## Results on i5-5257U (Broadwell mobile dual core 2.7 turbo 3.1Ghz)
|
||||
# Note that since Haswell, Intel CPU are significantly improed on Switch prediction
|
||||
# This probably won't carry to ARM devices
|
||||
# This probably won't carry to ARM devices
|
||||
|
||||
# Warmup: 4.081501s
|
||||
# result: -14604293096444
|
||||
|
@ -389,4 +389,4 @@ main(initial)
|
|||
# interp_handlers took 11.039072s for 1000000000 instructions: 90.58732473164413 Mips (M instructions/s)
|
||||
# result: -14604293096444
|
||||
# interp_methods took 23.359635s for 1000000000 instructions: 42.80888806695823 Mips (M instructions/s)
|
||||
```
|
||||
```
|
||||
|
|
Loading…
Reference in New Issue