use LTO in release builds (#1661)

* use LTO in release builds

This significantly (40%) speeds up block replay and hashing - for example replaying first 1000
blocks, without/with LTO:

```
[arnetheduck@tempus ncli]$ ../env.sh nim c -d:release ncli_db
[arnetheduck@tempus ncli]$ ./ncli_db bench --db:db --network:medalla --slots:1000
Loaded 215006 blocks, head slot 307400
All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
Validation is turned off meaning that no BLS operations are performed
   25468.481,        0.000,    25468.481,    25468.481,            1, Initialize DB
       0.297,        0.516,        0.053,       13.645,          721, Load block from database
      26.458,        0.000,       26.458,       26.458,            1, Load state from database
      20.737,        8.288,       11.096,      199.325,          690, Apply block
     333.069,       62.798,       45.225,      429.452,           31, Apply epoch block
       0.000,        0.000,        0.000,        0.000,            0, Database block store
```

```
[arnetheduck@tempus ncli]$ ../env.sh nim c -d:release --passc:-flto --passl:-flto --stacktrace:off ncli_db
[arnetheduck@tempus ncli]$ ./ncli_db bench --db:db --network:medalla --slots:1000
Loaded 215006 blocks, head slot 307400
All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
Validation is turned off meaning that no BLS operations are performed
   23903.006,        0.000,    23903.006,    23903.006,            1, Initialize DB
       0.253,        0.122,        0.047,        0.731,          721, Load block from database
      24.455,        0.000,       24.455,       24.455,            1, Load state from database
      18.734,        7.062,       10.346,      167.397,          690, Apply block
     194.869,       33.175,       29.311,      226.981,           31, Apply epoch block
```

Epoch processing is heavy on both arithmetics and hash caching, both of which get a
significant boost here.

This makes sense: nim creates lots of small functions spread out over many C files. A much
worse solution is to try to annotate code with `inline` - it copies functions to multiple
C files but still doesn't do intermodule optimizations significantly limiting the
compilers' ability to reason about the code, causing bloat and misrepresenting the usefulness
of a function to the call frequency analysis that drives actual (C-compiler) inlining and many
other optimizations.

In particular, many nim functions are part of `system` or the `C` backend - stack tracing,
memory allocation etc - nim's inlining system is pretty incomplete in that it does not deal
with these and many other cases.

* windows workaround

* skip LTO on windows for now
This commit is contained in:
Jacek Sieka 2020-09-24 18:40:28 +02:00 committed by GitHub
parent 0eb53f2802
commit 1a9e9fc398
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -3,6 +3,20 @@ if defined(release):
else: else:
switch("nimcache", "nimcache/debug/$projectName") switch("nimcache", "nimcache/debug/$projectName")
# `-flto` gives a significant improvement in processing speed, specially hash tree and state transition (basically any CPU-bound code implemented in nim)
# With LTO enabled, optimization flags should be passed to both compiler and linker!
if defined(release):
if defined(macosx): # Clang
switch("passC", "-flto=thin")
switch("passL", "-flto=thin")
elif defined(linux):
switch("passC", "-flto=auto")
switch("passL", "-flto=auto")
else:
# On windows, LTO needs more love and attention so the right linkers
# are used
discard
if defined(windows): if defined(windows):
# disable timestamps in Windows PE headers - https://wiki.debian.org/ReproducibleBuilds/TimestampsInPEBinaries # disable timestamps in Windows PE headers - https://wiki.debian.org/ReproducibleBuilds/TimestampsInPEBinaries
switch("passL", "-Wl,--no-insert-timestamp") switch("passL", "-Wl,--no-insert-timestamp")
@ -25,12 +39,15 @@ if defined(windows):
# use at least -msse2 or -msse3. # use at least -msse2 or -msse3.
if defined(disableMarchNative): if defined(disableMarchNative):
switch("passC", "-msse3") switch("passC", "-msse3")
switch("passL", "-msse3")
else: else:
switch("passC", "-march=native") switch("passC", "-march=native")
switch("passL", "-march=native")
if defined(windows): if defined(windows):
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782 # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782
# ("-fno-asynchronous-unwind-tables" breaks Nim's exception raising, sometimes) # ("-fno-asynchronous-unwind-tables" breaks Nim's exception raising, sometimes)
switch("passC", "-mno-avx512f") switch("passC", "-mno-avx512f")
switch("passL", "-mno-avx512f")
--threads:on --threads:on
--opt:speed --opt:speed