Revamp presentation - add PhD Thesis on VM + Forth benchmarks of threading techniques

2018-06-13 12:20:26 +02:00 · 2018-06-13 12:20:26 +02:00 · 7dd55007e0
parent cbbbd4a1c9
commit 7dd55007e0
1 changed files with 29 additions and 29 deletions
--- a/Interpreter-optimization-resources.md
+++ b/Interpreter-optimization-resources.md
@ -3,34 +3,33 @@ Target audience is Nimbus developers.

 ## Pure interpreter

-| Description                                                                                                                                                                                                                 | Link                                                                                                                    |
-| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
-| Basic overview of computed gotos | https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables |
-| Optimizing direct threaded code by selective inlining (Paper from 1998 which includes JIT introduction with code!)| http://flint.cs.yale.edu/jvmsem/doc/threaded.ps |
-| Design of a bytecode interpreter, including Stack vs Register, how to represent values (single type, tagged unions, untagged union, interface/virtual function)                                                             | http://gameprogrammingpatterns.com/bytecode.html                                                                        |
-| Writing a fast interpreter: control-flow graph optimization from LuaJIT author                                                                                                                                              | http://lua-users.org/lists/lua-l/2011-02/msg00742.html                                                                  |
-| In-depth dive on how to write an emulator                                                                                                                                                                                   | http://fms.komkon.org/EMUL8/HOWTO.html                                                                                  |
-| Review of interpreter dispatch strategies to limit branch mispredictions: direct threaded code vs indirect threaded code vs token threaded code vs switch based dispatching vs replicated switch dispatching + Bibliography | http://realityforge.org/code/virtual-machines/2011/05/19/interpreters.html                                              |
-| Fast VMs without assembly - speeding up the interpreter loop: threaded interpreter, duff's device, JIT, Nostradamus distributor                                                                                             | http://www.emulators.com/docs/nx25_nostradamus.htm                                                                      |
-| Switch case vs Table vs Function caching/dynarec                                                                                                                                                                            | http://ngemu.com/threads/switch-case-vs-function-table.137562/                                                          |
-| Jump tables vs Switch                                                                                                                                                                                                       | http://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html                                                     |
-| Paper: branch prediction and the performance of Interpreters - Don't trust the folklore                                                                                                                                     | https://hal.inria.fr/hal-01100647/document                                                                              |
-| Paper by author of ANTLR: The Structure and Performance of Efficient Interpreters                                                                                                                                           | https://www.jilp.org/vol5/v5paper12.pdf                                                                                 |
-| Paper by author of ANTLR introducing dynamic replication: Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreter                                                                                     | https://www.scss.tcd.ie/David.Gregg/papers/toplas05.pdf                                                                 |
-| Benchmarking VM Dispatch strategies in Rust: Switch vs unrolled switch vs tail call dispatch vs Computed Gotos                                                                                                              | https://pliniker.github.io/post/dispatchers/                                                                            |
-| Computed Gotos for fast dispatching in Python                                                                                                                                                                               | https://github.com/python/cpython/blob/9d6171ded5c56679bc295bacffc718472bcb706b/Python/ceval.c#L571-L608                |
+* Threading techniques for Forth (indirect, Direct, Token, Switch, Call, Segment threading)                                                                                                                                   - [link](http://www.complang.tuwien.ac.at/forth/threaded-code.html#call-threading)
+* Benchmark of interpreter dispatch techniques for Forth on x86, PPC, MIPS, SPARC, Itanium and ARM                                                                                                                            - [link](http://www.complang.tuwien.ac.at/forth/threading/)
+* PhD Thesis: Virtual machine Showdown: Stack vs Registers, with review of ALL interpreter dispatch techniques                                                                                                                - [link](https://www.scss.tcd.ie/publications/tech-reports/reports.07/TCD-CS-2007-49.pdf)
+* Basic overview of computed gotos                                                                                                                                                                                            - [link](https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables)
+* Optimizing direct threaded code by selective inlining (Paper from 1998 which includes JIT introduction with code!)                                                                                                          - [link](http://flint.cs.yale.edu/jvmsem/doc/threaded.ps)
+* Design of a bytecode interpreter, including Stack vs Register, how to represent values (single type, tagged unions, untagged union, interface/virtual function)                                                             - [link](http://gameprogrammingpatterns.com/bytecode.html)
+* Writing a fast interpreter: control-flow graph optimization from LuaJIT author                                                                                                                                              - [link](http://lua-users.org/lists/lua-l/2011-02/msg00742.html)
+* In-depth dive on how to write an emulator                                                                                                                                                                                   - [link](http://fms.komkon.org/EMUL8/HOWTO.html)
+* Review of interpreter dispatch strategies to limit branch mispredictions: direct threaded code vs indirect threaded code vs token threaded code vs switch based dispatching vs replicated switch dispatching + Bibliography - [link](http://realityforge.org/code/virtual-machines/2011/05/19/interpreters.html)
+* Fast VMs without assembly - speeding up the interpreter loop: threaded interpreter, duff's device, JIT, Nostradamus distributor by the author of Bosch x86 emulator                                                         - [link](http://www.emulators.com/docs/nx25_nostradamus.htm)
+* Switch case vs Table vs Function caching/dynarec                                                                                                                                                                            - [link](http://ngemu.com/threads/switch-case-vs-function-table.137562/)
+* Jump tables vs Switch                                                                                                                                                                                                       - [link](http://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html)
+* Paper: branch prediction and the performance of Interpreters - Don't trust the folklore                                                                                                                                     - [link](https://hal.inria.fr/hal-01100647/document)
+* Paper by author of ANTLR: The Structure and Performance of Efficient Interpreters                                                                                                                                           - [link](https://www.jilp.org/vol5/v5paper12.pdf)
+* Paper by author of ANTLR introducing dynamic replication: Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreter                                                                                     - [link](https://www.scss.tcd.ie/David.Gregg/papers/toplas05.pdf)
+* Benchmarking VM Dispatch strategies in Rust: Switch vs unrolled switch vs tail call dispatch vs Computed Gotos                                                                                                              - [link](https://pliniker.github.io/post/dispatchers/)
+* Computed Gotos for fast dispatching in Python                                                                                                                                                                               - [link](https://github.com/python/cpython/blob/9d6171ded5c56679bc295bacffc718472bcb706b/Python/ceval.c#L571-L608)

 ## JIT / Dynamic recompilation

-| Description                                                                                                                                                                                                                 | Link                                                                                                                    |
-| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
-| Optimizing direct threaded code by selective inlining | http://flint.cs.yale.edu/jvmsem/doc/threaded.ps |
-| Dynamic recompilation introduction                                                                                                                                                                                          | http://ngemu.com/threads/dynamic-recompilation-an-introduction.20491/                                                   |
-| Dynamic recompilation guide with Chip8                                                                                                                                                                                      | https://github.com/marco9999/Dynarec_Guide/blob/master/Introduction%20to%20Dynamic%20Recompilation%20in%20Emulation.pdf |
-| Dynamic recompilation - accompanying source code                                                                                                                                                                            | https://github.com/marco9999/Super8_jitcore/                                                                            |
-| Presentation: Interpretation (basic indirect and direct threaded) vs binary translation                                                                                                                                     | http://www.ittc.ku.edu/~kulkarni/teaching/EECS768/slides/chapter2.pdf                                                   |
-| Threaded interpretation vs Dynarec                                                                                                                                                                                          | http://www.emutalk.net/threads/55275-Threaded-interpretation-vs-Dynamic-Binary-Translation                              |
-| Dynamic recompilation wiki                                                                                                                                                                                                  | http://emulation.gametechwiki.com/index.php/Dynamic_recompilation                                                       |
+* Optimizing direct threaded code by selective inlining                                   - [link](http://flint.cs.yale.edu/jvmsem/doc/threaded.ps)
+* Dynamic recompilation introduction                                                      - [link](http://ngemu.com/threads/dynamic-recompilation-an-introduction.20491/)
+* Dynamic recompilation guide with Chip8                                                  - [link](https://github.com/marco9999/Dynarec_Guide/blob/master/Introduction%20to%20Dynamic%20Recompilation%20in%20Emulation.pdf)
+* Dynamic recompilation - accompanying source code                                        - [link](https://github.com/marco9999/Super8_jitcore/)
+* Presentation: Interpretation (basic indirect and direct threaded) vs binary translation - [link](http://www.ittc.ku.edu/~kulkarni/teaching/EECS768/slides/chapter2.pdf)
+* Threaded interpretation vs Dynarec                                                      - [link](http://www.emutalk.net/threads/55275-Threaded-interpretation-vs-Dynamic-Binary-Translation)
+* Dynamic recompilation wiki                                                              - [link](http://emulation.gametechwiki.com/index.php/Dynamic_recompilation)

 ## Context Threading

@ -41,6 +40,7 @@ that makes interpretation nice with the hardware branch predictor. Practical imp
  - [Paper](http://www.cs.toronto.edu/~matz/pubs/demkea_context.pdf)
  - [Powerpoint](https://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf)
  - [Review / Critic](https://www.complang.tuwien.ac.at/anton/lvas/sem06w/fest.pdf)
+  - Cited and reviewed in [Virtual Machine Showdown PhD Thesis](https://www.scss.tcd.ie/publications/tech-reports/reports.07/TCD-CS-2007-49.pdf)

 Basically, instead of computed goto, you have computed "call" and each section called is ended by
 the ret (return) instruction. Note that it the address called is still inline, there is no parameter pushed on the stack.
@ -61,7 +61,7 @@ arbitrary call and ret instructions.
 - [Bochs x86 emulator](https://sourceforge.net/projects/bochs/)
  - [Virtualization without Execution: Designing a portable VM - Powerpoint](http://bochs.sourceforge.net/VirtNoJit.pdf)
  - [Virtualization without Execution - Paper](http://bochs.sourceforge.net/Virtualization_Without_Hardware_Final.pdf)
-  - Author is also the author of the Nostradamus Distributor linked in pure itnerpreter optimizations
+  - Author is also the author of the Nostradamus Distributor linked in pure interpreter optimizations
 - MorphoVM
  - Thesis: [Morpho VM: An Indirect Threaded Stackless
 Virtual Machine](https://skemman.is/bitstream/1946/4809/1/hhg-bs.pdf)
@ -374,9 +374,9 @@ let initial = if arguments.len > 0: parseInt($arguments[0])

 main(initial)

-## Results on i5-5257U (Broadwell mobile dual core 2.7 turbo 3.1Ghz) 
+## Results on i5-5257U (Broadwell mobile dual core 2.7 turbo 3.1Ghz)
 # Note that since Haswell, Intel CPU are significantly improed on Switch prediction
-# This probably won't carry to ARM devices 
+# This probably won't carry to ARM devices

 # Warmup: 4.081501s
 # result: -14604293096444
@ -389,4 +389,4 @@ main(initial)
 # interp_handlers took 11.039072s for 1000000000 instructions: 90.58732473164413 Mips (M instructions/s)
 # result: -14604293096444
 # interp_methods took 23.359635s for 1000000000 instructions: 42.80888806695823 Mips (M instructions/s)
-```
+```