OpenMP tutorial

This is an introductory tutorial for OpenMP using GCC. The tutorial runs on Linux and has been checked on Ubuntu. The tutorial was created to present OpenMP to final year students of the Electrical & Computer Engineering Department of University of Patras, during the course Parallel and Distributed Processing and Applications.
The accompanying code is available under GPL from google code, here.

System pre-requisites

We need packages build-essential and and libgomp1. The first package installs in one step all basic development tools for Ubuntu, like the GCC compiler, GNU make, etc. The libgomp1 is the GCC OpenMP support library. Install them using:

sudo apt-get install build-essential libgomp1

Building

File common.mk is supposed to be included in every project. It contains the necessary OpenMP flags in order to build the programs. This way, a project with a single source file, hello.c which is supposed to create the executable hello needs only the following declarations:

TARGET := hello
OBJ := hello.o
include ../common.mk

To build the project, just run make.
There is also a master Makefile. The only thing it does is to call all programs default or clean targets. Thus, to build overything from the top level directory, run:

$ make

and to delete all binaries and intermediate files, run:

$ make clean

Hello, OpenMP

The Hello, OpenMP program is in directory hello.

OpenMP controls parallelism using pre-processing directives, known as pragmas. All OpenMP directives start with #pragma omp or simply #omp.

The pre-processor directive #pragma omp parallel defines a parallel region. The private clause takes program variables as comma separated arguments and states that one copy of each variable will be created for all threads, and this variable will be private to the thread.

Build it:

$ make
gcc -fopenmp   -c -o hello.o hello.c
gcc -fopenmp  hello.o   -o hello

Note that my CPU is an Intel i5-750 quad core. So, running the program, results in four threads:

$ ./hello
Hello from OpenMP thread 0
Hello from OpenMP thread 1
Hello from OpenMP thread 2
Hello from OpenMP thread 3

To control the number of threads externally, use the OMP_NUM_THREADS environment variable. See the demonstration for two threads.

$ export OMP_NUM_THREADS=2
$ ./hello
Hello from OpenMP thread 0
Hello from OpenMP thread 1

Also, we can overload cores with more than one threads. See what happens for 8 threads.

$ export OMP_NUM_THREADS=8
$ ./hello
Hello from OpenMP thread 7
Hello from OpenMP thread 3
Hello from OpenMP thread 4
Hello from OpenMP thread 5
Hello from OpenMP thread 1
Hello from OpenMP thread 6
Hello from OpenMP thread 2
Hello from OpenMP thread 0

Parallel for

The Parallel for program is in directory for.

The program creates two vectors, a and b, and initializes them with random data. The size is defined by the macro SIZE. Then, vectors c and cv are filled with the sum of vectors a and b. The code offers two alternatives, one parallel using OpenMP and one serial.

Loops can be executed in parallel using the directive #pragma omp parallel for. OpenMP automatically partitions the individual iterations of a loop and assigns them to different threads. Again, we need to assign shared and private variables. Shared variables are accessible from all threads and we must be careful for race conditions. When a variable is private, the original value is copied to each thread and each threads has access to its own copy of the variable.

Let’s set the size to 100, build the code and see what happens for various thread numbers.

$ export OMP_NUM_THREADS=2
$ ./for

Parallel for

Size 100
Serial run time: 0.002304 msec
Parallel run time: 0.097218 msec
Comparing results: SUCCESS
$ export OMP_NUM_THREADS=4
$ ./for

Parallel for

Size 100
Serial run time: 0.002305 msec
Parallel run time: 0.184798 msec
Comparing results: SUCCESS
$ export OMP_NUM_THREADS=8
$ ./for

Parallel for

Size 100
Serial run time: 0.002375 msec
Parallel run time: 0.340542 msec
Comparing results: SUCCESS

Oops? What happened?

Let’s change the size to 1000000.

$ export OMP_NUM_THREADS=2
$ ./for

Parallel for

Size 1000000
Serial run time: 6.796168 msec
Parallel run time: 3.927334 msec
Comparing results: SUCCESS
$ export OMP_NUM_THREADS=4
$ ./for

Parallel for

Size 1000000
Serial run time: 6.921321 msec
Parallel run time: 2.202346 msec
Comparing results: SUCCESS
$ export OMP_NUM_THREADS=8
$ ./for

Parallel for

Size 1000000
Serial run time: 6.974470 msec
Parallel run time: 2.799692 msec
Comparing results: SUCCESS

Ok, this seems better, we see a speedup of 3.3X at 4 threads. Remember that my processor is quad core and it’s not exactly idle at the time of the “benchmark”, running X and all my usual suspects, Eclipse, 40-50 chrome tabs, 5-10 firefox tabs, thunderbird, ..

So, what happened? Threads creation, iterations partitioning and scheduling introduce overheads. When the task is very small, like adding 100 numbers, these overheads dominate the running time. Don’t forget Amdhal.

Tasks

The tasks example is similar to parallel for. The difference is that c vector contains a+b, while d vector contains a-b. The addition is to be performed by one thread, while the subtraction to be performed by another. It should be noted that this is a bad example for tasks, presented only for demonstration.

The tasks actually define threads of execution. They must exist within an omp parallel region. A single region defines a step that needs to be performed only by the root thread. Tasks can define shared and private variables as in the parallel for example.

Let’s run the example using a variable number of threads.

Threads: 1

$ export OMP_NUM_THREADS=1
$ ./tasks

Parallel tasks

Size 1000000
Serial run time: 12.459671 msec
Parallel run time: 14.227681 msec
Comparing results for c: SUCCESS
Comparing results for d: SUCCESS

Threads: 2

$ export OMP_NUM_THREADS=2
$ ./tasks

Parallel tasks

Size 1000000
Serial run time: 12.480135 msec
Parallel run time: 7.439402 msec
Comparing results for c: SUCCESS
Comparing results for d: SUCCESS

Threads: 4

$ export OMP_NUM_THREADS=4
$ ./tasks

Parallel tasks

Size 1000000
Serial run time: 12.563873 msec
Parallel run time: 7.793634 msec
Comparing results for c: SUCCESS
Comparing results for d: SUCCESS

Threads: 8

$ export OMP_NUM_THREADS=8
$ ./tasks

Parallel tasks

Size 1000000
Serial run time: 12.667728 msec
Parallel run time: 7.871645 msec
Comparing results for c: SUCCESS
Comparing results for d: SUCCESS

The expected behavior is observed here. The optimal result is observed for two threads, the single thread alternative suffers a little overhead and the 4 and 8 thread alternatives suffer a negligent overhead.

It should be noted again that tasks introduce more overhead than parallel for loops.

Exercise: Compiler optimization options

Now, let’s enable some GCC optimization flags – what I actually do was to select what seemed relevant from a big list, without profiling or anything.

Compile using:

make CFLAGS+='-O3 -falign-arrays -fexpensive-optimizations -funroll-loops \
	-fmove-loop-invariants -fprefetch-loop-arrays -ftree-loop-optimize \
	-ftree-vect-loop-version -ftree-vectorize'

Let’s see the same results:

$ export OMP_NUM_THREADS=1
$ ./tasks

Parallel tasks

Size 1000000
Serial run time: 3.942910 msec
Parallel run time: 4.579994 msec
Comparing results for c: SUCCESS
Comparing results for d: SUCCESS
$
$ export OMP_NUM_THREADS=2
$ ./tasks

Parallel tasks

Size 1000000
Serial run time: 3.669765 msec
Parallel run time: 2.358651 msec
Comparing results for c: SUCCESS
Comparing results for d: SUCCESS
$
$ export OMP_NUM_THREADS=4
$ ./tasks

Parallel tasks

Size 1000000
Serial run time: 3.545170 msec
Parallel run time: 2.317864 msec
Comparing results for c: SUCCESS
Comparing results for d: SUCCESS
$
$ export OMP_NUM_THREADS=8
$ ./tasks
Parallel tasks

Size 1000000
Serial run time: 3.661454 msec
Parallel run time: 2.407260 msec
Comparing results for c: SUCCESS
Comparing results for d: SUCCESS

This shows the impressive gains from modern compiler optimization. With all these enabled, the speedup using OpenMP remains about 1.5X for two threads enabled.

Issues not covered here

This section covers introductory issues that were not covered in this tutorial.

#pragma omp critical

Defines a critical region

#pragma omp barrier

Defines a barrier, a point in code that needs to be reached by all active threads before continuing further.

#pragma omp atomic
     statement

Defines an atomic operation. The region it defines can only contain a statement of the form x++, ++x, x--, --x or a statement of x binop=E, where binop is one of +,-,*,/,&,^,|,<>.

Another OpenMP construct that provides synchronization is the locking mechanism.

#pragma omp parallel for reduction (op : list)

Unlike the above constructs, this is not a synchronization construct, but it involves synchronization. It performs a reduction operation `op` to the `list` of reduction variables. Reduction and prefix operations are very interesting parallel algorithmic techniques.

One more thing …

False sharing is a significant problem in multi-core programming. False sharing occurs when, for any reason, a core needs to update a variable that is in the cache of another core. This can happen either because the other core cache has fetched the same variable or because it has fetched a whole region called cache-line.

Before we proceed, let’s try to have some basic understanding of multicore cache management. Usually, multicore processors feature different levels of cache, with per-core caches and shared caches. When a cache miss happens, the cache will fetch not only the memory location but a bigger part, called a cache-line. At this time, this cache – and thus the corresponding core – becomes owner of this cache line. When a cache miss occurs to another core cache for this memory location, a cache coherency mechanism is responsible to keep the system in a consistent state. The exact protocol is architecture dependent. What is necessary is to update memory using the previous owner cache contents, invalidate the first cache line and bring the cache line to the new owner cache. The update to memory can be avoided if the cache line is not marked dirty – this means updated after fetched. The idea that the first core to access a memory location becomes owner of this location (in the cache) is referred as first touch.

In the false-sharing directory you can see a demonstration of false sharing. Consider a situation where concurrent threads perform a computation and each one updates one element in an array. Each time a thread tries to access its own element in the array, the whole cache line needs to change owner from core to core. So, what is demonstrated is a constant ping pong of cache-line ownership.

The program false-sharing computes the sum of a vector of random numbers. The first part computes them sequentially, in order to validate the result. The second part shares the original vector elements to the available threads and each one computes the partial sum. The partial sums are written in an array. To remove the false sharing problem, instead of the partial sums array, we define thread-private variables. At the end, using a critical section, we add the partial sum to the global sum. The same effect could be achieved with an atomic operation.

$ export OMP_NUM_THREADS=4
$ ./false-sharing 
Parallel sum

Size 10000000
Serial run time: 25.720492 msec
Parallel run time, false sharing: 44.963422 msec
Comparing results: SUCCESS
Parallel run time, no false sharing: 13.556660 msec
Comparing results: SUCCESS

References

For more tutorials, see the following online content.

* http://www.ipd.uni-karlsruhe.de/multicore/research/download/HowToGuide-OpenMP.pdf
* http://www.ichec.ie/Slides/OpenMP/ICHEC-OpenMPCourseSlides.pdf

About these ads

Posted on October 28, 2011, in Programming. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: