History

Mamy Ratsimbazafy 12752b9e5e Docs (#8 )		2022-01-16 18:33:43 +01:00
..
README.md	Polish (#7 )	2022-01-16 08:57:06 +01:00
stdnim_fib.nim	Polish (#7 )	2022-01-16 08:57:06 +01:00
taskpool_fib.nim	Docs (#8 )	2022-01-16 18:33:43 +01:00

README.md

Fibonacci benchmarks

⚠️ Disclaimer: Please don't use parallel fibonacci in production! Use the fast doubling method with memoization instead.

Fibonacci benchmark has 3 draws:

It's very simple to implement
It's unbalanced and efficiency requires distributions to avoid idle cores.
It's a very effective scheduler overhead benchmark, because the basic task is very trivial and the task spawning grows at 2^n scale.

Want to know the difference between low and high overhead?

Run the following C code (taken from Oracle OpenMP example)

#include <stdio.h>
#include <omp.h>
int fib(int n)
{
  int i, j;
  if (n<2)
    return n;
  else
    {
       #pragma omp task shared(i) firstprivate(n)
       {
         i=fib(n-1);
       }

       j=fib(n-2);
       #pragma omp taskwait
       return i+j;
    }
}

int main()
{
  int n = 40;

  #pragma omp parallel shared(n)
  {
    #pragma omp single
    printf ("fib(%d) = %d\n", n, fib(n));
  }
}

First compile with Clang and run it

clang -O3 -fopenmp benchmarks/fibonacci/omp_fib.c
time a.out

It should be fairly quick

Then compile with GCC and run it

gcc -O3 -fopenmp benchmarks/fibonacci/omp_fib.c
time a.out

Notice how some cores get idle as time goes on? Don't forget to kill the benchmark, you'll be there all day.

What's happening?

GCC's OpenMP implementation uses a single queue for all tasks. That queue gets constantly hammered by all threads and becomes a contention point. Furthermore, it seems like there is no load balancing or that due to the contention/lock threads are descheduled.

However Clang implementation uses a work-stealing scheduler with one deque per thread. The only contention happens when a thread run out of work and has to look for more work, in the deque of other threads. And which thread to check is chosen at random so the potential contention is distributed among all threads instead of a single structure.