Mamy Ratsimbazafy 2931913b67
Add a threadpool (#213)
* Implement a threadpool

* int and SomeUnsignedInt ...

* Type conversion for windows SynchronizationBarrier

* Use the latest MacOS 11, Big Sur API (jan 2021) for MacOS futexes, Github action offers MacOS 12 and can test them

* bench need posix timer not available on windows and darwin futex

* Windows: nimble exec empty line is an error, Mac: use defined(osx) instead of defined(macos)

* file rename

* okay, that's the last one hopefully

* deactivate stealHalf for now
2023-01-24 02:32:28 +01:00

2.0 KiB

Fibonacci benchmarks

⚠️ Disclaimer: Please don't use parallel fibonacci in production! Use the fast doubling method with memoization instead.

Fibonacci benchmark has 3 draws:

  1. It's very simple to implement
  2. It's unbalanced and efficiency requires distributions to avoid idle cores.
  3. It's a very effective scheduler overhead benchmark, because the basic task is very trivial and the task spawning grows at 2^n scale.

Want to know the difference between low and high overhead?

Run the following C code (taken from Oracle OpenMP example)

#include <stdio.h>
#include <omp.h>
int fib(int n)
{
  int i, j;
  if (n<2)
    return n;
  else
    {
       #pragma omp task shared(i) firstprivate(n)
       {
         i=fib(n-1);
       }

       j=fib(n-2);
       #pragma omp taskwait
       return i+j;
    }
}

int main()
{
  int n = 40;

  #pragma omp parallel shared(n)
  {
    #pragma omp single
    printf ("fib(%d) = %d\n", n, fib(n));
  }
}

First compile with Clang and run it

clang -O3 -fopenmp benchmarks/fibonacci/omp_fib.c
time a.out

It should be fairly quick

Then compile with GCC and run it

gcc -O3 -fopenmp benchmarks/fibonacci/omp_fib.c
time a.out

Notice how some cores get idle as time goes on? Don't forget to kill the benchmark, you'll be there all day.

What's happening?

GCC's OpenMP implementation uses a single queue for all tasks. That queue gets constantly hammered by all threads and becomes a contention point. Furthermore, it seems like there is no load balancing or that due to the contention/lock threads are descheduled.

However Clang implementation uses a work-stealing scheduler with one queue per thread. The only contention happens when a thread run out of work and has to look for more work, in the queue of other threads. And which thread to check is chosen at random so the potential contention is distributed among all threads instead of a single structure.