This is an introductory tutorial for OpenMP using GCC. The tutorial runs on Linux and has been checked on Ubuntu. The tutorial was created to present OpenMP to final year students of the Electrical & Computer Engineering Department of University of Patras, during the course Parallel and Distributed Processing and Applications.
The accompanying code is available under GPL from google code, here.
We need packages
build-essential and and
libgomp1. The first package installs in one step all basic development tools for Ubuntu, like the GCC compiler, GNU make, etc. The
libgomp1 is the GCC OpenMP support library. Install them using:
sudo apt-get install build-essential libgomp1
common.mk is supposed to be included in every project. It contains the necessary OpenMP flags in order to build the programs. This way, a project with a single source file,
hello.c which is supposed to create the executable
hello needs only the following declarations:
TARGET := hello OBJ := hello.o include ../common.mk
To build the project, just run
There is also a master Makefile. The only thing it does is to call all programs default or clean targets. Thus, to build overything from the top level directory, run:
and to delete all binaries and intermediate files, run:
$ make clean
The Hello, OpenMP program is in directory hello.
OpenMP controls parallelism using pre-processing directives, known as pragmas. All OpenMP directives start with
#pragma omp or simply
The pre-processor directive
#pragma omp parallel defines a parallel region. The
private clause takes program variables as comma separated arguments and states that one copy of each variable will be created for all threads, and this variable will be private to the thread.
$ make gcc -fopenmp -c -o hello.o hello.c gcc -fopenmp hello.o -o hello
Note that my CPU is an Intel i5-750 quad core. So, running the program, results in four threads:
$ ./hello Hello from OpenMP thread 0 Hello from OpenMP thread 1 Hello from OpenMP thread 2 Hello from OpenMP thread 3
To control the number of threads externally, use the
OMP_NUM_THREADS environment variable. See the demonstration for two threads.
$ export OMP_NUM_THREADS=2 $ ./hello Hello from OpenMP thread 0 Hello from OpenMP thread 1
Also, we can overload cores with more than one threads. See what happens for 8 threads.
$ export OMP_NUM_THREADS=8 $ ./hello Hello from OpenMP thread 7 Hello from OpenMP thread 3 Hello from OpenMP thread 4 Hello from OpenMP thread 5 Hello from OpenMP thread 1 Hello from OpenMP thread 6 Hello from OpenMP thread 2 Hello from OpenMP thread 0
The Parallel for program is in directory
The program creates two vectors,
b, and initializes them with random data. The size is defined by the macro SIZE. Then, vectors
cv are filled with the sum of vectors
b. The code offers two alternatives, one parallel using OpenMP and one serial.
Loops can be executed in parallel using the directive
#pragma omp parallel for. OpenMP automatically partitions the individual iterations of a loop and assigns them to different threads. Again, we need to assign shared and private variables. Shared variables are accessible from all threads and we must be careful for race conditions. When a variable is private, the original value is copied to each thread and each threads has access to its own copy of the variable.
Let’s set the size to 100, build the code and see what happens for various thread numbers.
$ export OMP_NUM_THREADS=2 $ ./for Parallel for Size 100 Serial run time: 0.002304 msec Parallel run time: 0.097218 msec Comparing results: SUCCESS
$ export OMP_NUM_THREADS=4 $ ./for Parallel for Size 100 Serial run time: 0.002305 msec Parallel run time: 0.184798 msec Comparing results: SUCCESS
$ export OMP_NUM_THREADS=8 $ ./for Parallel for Size 100 Serial run time: 0.002375 msec Parallel run time: 0.340542 msec Comparing results: SUCCESS
Oops? What happened?
Let’s change the size to 1000000.
$ export OMP_NUM_THREADS=2 $ ./for Parallel for Size 1000000 Serial run time: 6.796168 msec Parallel run time: 3.927334 msec Comparing results: SUCCESS
$ export OMP_NUM_THREADS=4 $ ./for Parallel for Size 1000000 Serial run time: 6.921321 msec Parallel run time: 2.202346 msec Comparing results: SUCCESS
$ export OMP_NUM_THREADS=8 $ ./for Parallel for Size 1000000 Serial run time: 6.974470 msec Parallel run time: 2.799692 msec Comparing results: SUCCESS
Ok, this seems better, we see a speedup of 3.3X at 4 threads. Remember that my processor is quad core and it’s not exactly idle at the time of the “benchmark”, running X and all my usual suspects, Eclipse, 40-50 chrome tabs, 5-10 firefox tabs, thunderbird, ..
So, what happened? Threads creation, iterations partitioning and scheduling introduce overheads. When the task is very small, like adding 100 numbers, these overheads dominate the running time. Don’t forget Amdhal.
The tasks example is similar to parallel for. The difference is that c vector contains a+b, while d vector contains a-b. The addition is to be performed by one thread, while the subtraction to be performed by another. It should be noted that this is a bad example for tasks, presented only for demonstration.
The tasks actually define threads of execution. They must exist within an omp parallel region. A single region defines a step that needs to be performed only by the root thread. Tasks can define shared and private variables as in the parallel for example.
Let’s run the example using a variable number of threads.
$ export OMP_NUM_THREADS=1 $ ./tasks Parallel tasks Size 1000000 Serial run time: 12.459671 msec Parallel run time: 14.227681 msec Comparing results for c: SUCCESS Comparing results for d: SUCCESS
$ export OMP_NUM_THREADS=2 $ ./tasks Parallel tasks Size 1000000 Serial run time: 12.480135 msec Parallel run time: 7.439402 msec Comparing results for c: SUCCESS Comparing results for d: SUCCESS
$ export OMP_NUM_THREADS=4 $ ./tasks Parallel tasks Size 1000000 Serial run time: 12.563873 msec Parallel run time: 7.793634 msec Comparing results for c: SUCCESS Comparing results for d: SUCCESS
$ export OMP_NUM_THREADS=8 $ ./tasks Parallel tasks Size 1000000 Serial run time: 12.667728 msec Parallel run time: 7.871645 msec Comparing results for c: SUCCESS Comparing results for d: SUCCESS
The expected behavior is observed here. The optimal result is observed for two threads, the single thread alternative suffers a little overhead and the 4 and 8 thread alternatives suffer a negligent overhead.
It should be noted again that tasks introduce more overhead than parallel for loops.
Exercise: Compiler optimization options
Now, let’s enable some GCC optimization flags – what I actually do was to select what seemed relevant from a big list, without profiling or anything.
make CFLAGS+='-O3 -falign-arrays -fexpensive-optimizations -funroll-loops \ -fmove-loop-invariants -fprefetch-loop-arrays -ftree-loop-optimize \ -ftree-vect-loop-version -ftree-vectorize'
Let’s see the same results:
$ export OMP_NUM_THREADS=1 $ ./tasks Parallel tasks Size 1000000 Serial run time: 3.942910 msec Parallel run time: 4.579994 msec Comparing results for c: SUCCESS Comparing results for d: SUCCESS $ $ export OMP_NUM_THREADS=2 $ ./tasks Parallel tasks Size 1000000 Serial run time: 3.669765 msec Parallel run time: 2.358651 msec Comparing results for c: SUCCESS Comparing results for d: SUCCESS $ $ export OMP_NUM_THREADS=4 $ ./tasks Parallel tasks Size 1000000 Serial run time: 3.545170 msec Parallel run time: 2.317864 msec Comparing results for c: SUCCESS Comparing results for d: SUCCESS $ $ export OMP_NUM_THREADS=8 $ ./tasks Parallel tasks Size 1000000 Serial run time: 3.661454 msec Parallel run time: 2.407260 msec Comparing results for c: SUCCESS Comparing results for d: SUCCESS
This shows the impressive gains from modern compiler optimization. With all these enabled, the speedup using OpenMP remains about 1.5X for two threads enabled.
Issues not covered here
This section covers introductory issues that were not covered in this tutorial.
#pragma omp critical
Defines a critical region
#pragma omp barrier
Defines a barrier, a point in code that needs to be reached by all active threads before continuing further.
#pragma omp atomic statement
Defines an atomic operation. The region it defines can only contain a statement of the form
x++, ++x, x--, --x or a statement of
x binop=E, where binop is one of
Another OpenMP construct that provides synchronization is the locking mechanism.
#pragma omp parallel for reduction (op : list)
Unlike the above constructs, this is not a synchronization construct, but it involves synchronization. It performs a reduction operation `op` to the `list` of reduction variables. Reduction and prefix operations are very interesting parallel algorithmic techniques.
One more thing …
False sharing is a significant problem in multi-core programming. False sharing occurs when, for any reason, a core needs to update a variable that is in the cache of another core. This can happen either because the other core cache has fetched the same variable or because it has fetched a whole region called cache-line.
Before we proceed, let’s try to have some basic understanding of multicore cache management. Usually, multicore processors feature different levels of cache, with per-core caches and shared caches. When a cache miss happens, the cache will fetch not only the memory location but a bigger part, called a cache-line. At this time, this cache – and thus the corresponding core – becomes owner of this cache line. When a cache miss occurs to another core cache for this memory location, a cache coherency mechanism is responsible to keep the system in a consistent state. The exact protocol is architecture dependent. What is necessary is to update memory using the previous owner cache contents, invalidate the first cache line and bring the cache line to the new owner cache. The update to memory can be avoided if the cache line is not marked dirty – this means updated after fetched. The idea that the first core to access a memory location becomes owner of this location (in the cache) is referred as first touch.
false-sharing directory you can see a demonstration of false sharing. Consider a situation where concurrent threads perform a computation and each one updates one element in an array. Each time a thread tries to access its own element in the array, the whole cache line needs to change owner from core to core. So, what is demonstrated is a constant ping pong of cache-line ownership.
false-sharing computes the sum of a vector of random numbers. The first part computes them sequentially, in order to validate the result. The second part shares the original vector elements to the available threads and each one computes the partial sum. The partial sums are written in an array. To remove the false sharing problem, instead of the partial sums array, we define thread-private variables. At the end, using a critical section, we add the partial sum to the global sum. The same effect could be achieved with an atomic operation.
$ export OMP_NUM_THREADS=4 $ ./false-sharing Parallel sum Size 10000000 Serial run time: 25.720492 msec Parallel run time, false sharing: 44.963422 msec Comparing results: SUCCESS Parallel run time, no false sharing: 13.556660 msec Comparing results: SUCCESS
For more tutorials, see the following online content.