Introducing OpenMP: A Portable, Parallel Programming API for Shared Memory Multiprocessors
By Neelakanth Nadgir and Richard Friedman
Sun ONE Studio compilers (C/C++/Fortran 95) support OpenMP
parallelization
natively. OpenMP is an emerging standard model for parallel programming in a shared
memory
environment. It provides a set of pragma's for programmers to easily parallelize their
code. This article provides a brief introduction to OpenMP.
This article is of particular interest to programmers
who are new to OpenMP and parallel programming in Fortran, C, or C++.
What Is OpenMP?
OpenMP is a set of specifications and interfaces for parallelizing programs in a
shared
memory environment. OpenMP provides a set of pragmas which, when used in a program,
directs
an OpenMP-aware compiler to generate an executable that will run over multiple
processors
in parallel. No other source code modifications are necessary (other than fine tuning
to
get the maximum performance). OpenMP pragmas enable you to use an elegant and uniform
interface
to parallelize programs on various architectures and systems. OpenMP is a widely
accepted
specification, and vendors like Sun, KAI, and SGI support it. Currently OpenMP specs
for
Fortran, C and C++ programming languages are available. (See the Related Information
box
above for a link to the OpenMP website to find the latest OpenMP specification
documents.)
OpenMP takes parallel programming to the next level by creating and synchronizing
threads
for you. All you need to do is insert appropriate pragmas in the source program, and
then
build the program with a compiler supporting OpenMP. The compiler interprets these
pragmas
and parallelizes the code following the pragma. When using compilers that are not
OpenMP-aware,
the OpenMP pragmas are silently ignored.
(This article gives examples of using OpenMP with C programs. Equivalent pragmas
exist
for Fortran 95 as well. See the OpenMP User's Guide for
details.)
OpenMP Pragmas
The OpenMP specification defines a set of pragmas. These pragmas are compiler
directives
on how to process the block of code that follows the pragma. The most basic pragma is
the
#pragma omp parallel. The parallel pragma denotes a parallel
region. The main thread of execution is called the master
thread.
Once the master thread encounters the parallel pragma, it creates a team of worker
threads
that then distribute the work among themselves and the master thread. The
environmental
variable OMP_NUM_THREADS controls the number of worker threads that are
created.
At the end of the parallel region, all threads wait for each other (also accomplished
by
a barrier pragma) and the program continues executing sequentially with
the
master thread.
OpenMP supports two basic kinds of parallelism - loops and sections. The
#pragma
omp for is used for loops, and #pragma omp section is used for
sections.
Sections are blocks of code that can be executed in parallel. These pragmas can be
used
in a nested fashion. A combination of parallel for and
section
pragmas can also be used.
The #pragma omp master instructs the compiler that the following block
of
code is to be executed by the master thread only. The #pragma omp barrier
instructs all threads to wait for each other. There is an implicit barrier pragma at
the
end of a parallel region. The #pragma omp single indicates that only one
thread
should execute the following block of code. This thread may not necessarily be the
master
thread. You can protect blocks of code that are not threadsafe by using the
#pragma
omp critical pragma. Of course all of these make sense only in the context of a
parallel pragma (parallel region).
Using a simple matrix multiplication program you can see how to use OpenMP to parallelize the
program. Consider the following small code fragment that multiplies 2 matrices. This is a very
simple example and, if you really want a good matrix multiply routine, you will have to consider
cache effects, or use a better algorithmn (Strassen's, or Coppersmith and Winograd's, etc.).
for (ii = 0; ii < nrows; ii++){
for(jj = 0; jj < ncols; jj++){
for (kk = 0; kk < nrows; kk++){
array[ii][jj] = array[ii][kk] * array[kk][jj];
}
}
}
Parallelizing the above code segment is straightforward: Insert the #pragma
omp
parallel for pragma before the first loop. It is beneficial to use the pragmas
at
the highest loop, since it gives the most performance gain. Since there are no
inter-loop
dependencies, or any conflicting variables, you don't need to declare any shared or
private
variables. The preceding code now becomes:
#pragma omp parallel for
for (ii = 0; ii < nrows; ii++){
for(jj = 0; jj < ncols; jj++){
for (kk = 0; kk < nrows; kk++){
array[ii][jj] = array[ii]kk] * array[kk][jj];
}
}
}
As another example, consider the following code fragment that finds the sum of f(x) for
0
<= x < n.
for(ii = 0; ii < n; ii++){
sum = sum + some_complex_long_fuction(a[ii]);
}
To parallelize the above fragment, the first step could be
#pragma omp parallel for shared(sum)
for(ii = 0; ii < n; ii++){
value = some_complex_long_fuction(a[ii]);
#pragma omp critical
sum = sum + value;
}
or better, you can use the reduction clause to get
#pragma omp parallel for private(sum) reduction(+: sum)
for(ii = 0; ii < n; ii++){
sum = sum + some_complex_long_fuction(a[ii]);
}
OpenMP provides a few runtime enviromental variables that can be used to control the
behavior
of the OpenMP-program. The most important and widely used variable is
OMP_NUM_THREADS.
OMP_NUM_THREADS determines the number of worker threads that will be
created
when the master thread encounters a parallel region. The general rule is to make the
number
of threads equal to the number of processors in the system.
How to Begin
There are several ways to parallelize programs. First, determine if you need
parallelization.
Sometimes, parallelization requires big machines, and some algorithms are not suitable
for parallelizing. If you are starting a new project, you could choose an algorithm
that
can be parallelized. It is very important to be sure that the code is correct
(serially)
before trying to parallelize it. Be sure to maintain timings of your serial run, so
that
you can decide if parallelization is useful.
Compile the serial version with several optimization options. The compiler can
generally
perform more lower level optimizations than you can. Try using the automatic
parallelization
options of the compiler. Delegating parallelization to the compiler makes it easier
for
you to maintain a common source code base. The autoparallelizer can also help you
identify
pieces of code that can be parallelized, or point out things in the code that could
prevent
parallelization (for example, a function call inside a for loop). You can accomplish
this
by compiling your program with the -g flag, and using the
er_src(1)
utility in the Sun ONE Studio Compiler Collection (formerly Forte Developer).
er_src program_binary_file function_name
Identify bottlenecks in the program using a profiling tool, such as Forte Performance Analyzer
or Rational Quantify. This should help you identify routines (hot routines) where the
major amount of time is spent. It is important that this is user CPU time, and not system time,
since system time may be sequential time (two threads trying to read a disk segment).
Once you have identified the hot routines, study them to find loops that do much of the
computation. Try using the -xautopar option of the Forte C compiler to identify
loops that the compiler thinks can be parallelized. Identify shared and private variables by
studying the interloop dependencies. Parallelize them using OpenMP pragmas. If you are lucky
they should work fine. If not, try setting OMP_NUM_THREADS to 1 and see if the
correct results are generated. You can also use dbx's runtime checking or tools
like AssureView to find bugs in the program.
Mixing OpenMP With MPI
MPI (Message Passing Interface) is another model for parallelprogramming. Unlike
OpenMP,
MPI spawns multiple processes that then communicate using TCP/IP. Since these
processes
do not share the same address space, they can run on remote machines (or a cluster of
machines).
It is difficult to say whether OpenMP or MPI is better. They both have their
advantages
and disadvantages. What is more interesting is that OpenMP can be used with MPI.
Typically,
you would use MPI to coarsely distribute work among several SMP machines, and
then use OpenMP to parallelize at a finer level. For more information on using mixed
mode
OpenMP, see
Mixed Mode MPI/OpenMP Programming.
Tools for Using OpenMP
For more information on C, C++, and Fortran support for Sun compilers, please see: Sun Studio 10
software.
You can profile your OpenMP programs using Sun Studio 10
software.
Resources